Tweet2Vec: Character-Based Distributed Representations for Social Media
This addresses the challenge of handling noisy text in social media for NLP applications, though it is incremental as it builds on existing embedding methods.
The paper tackled the problem of representing social media text with informal language and spelling errors by proposing a character-based model, tweet2vec, which outperformed a word-level baseline in predicting user-annotated hashtags, especially for out-of-vocabulary words.
Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.