LG CLMay 11, 2016

Tweet2Vec: Character-Based Distributed Representations for Social Media

Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, William W. Cohen

arXiv:1605.03481v221.8180 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of handling noisy text in social media for NLP applications, though it is incremental as it builds on existing embedding methods.

The paper tackled the problem of representing social media text with informal language and spelling errors by proposing a character-based model, tweet2vec, which outperformed a word-level baseline in predicting user-annotated hashtags, especially for out-of-vocabulary words.

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

View on arXiv PDF Code

Similar