LGCLMay 11, 2016

Tweet2Vec: Character-Based Distributed Representations for Social Media

arXiv:1605.03481v2180 citations
AI Analysis

This addresses the challenge of handling noisy text in social media for NLP applications, though it is incremental as it builds on existing embedding methods.

The paper tackled the problem of representing social media text with informal language and spelling errors by proposing a character-based model, tweet2vec, which outperformed a word-level baseline in predicting user-annotated hashtags, especially for out-of-vocabulary words.

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes