Discriminating between similar languages in Twitter using label propagation
This work addresses the challenge of language identification for similar languages in social media, which is crucial for linguistic processing tasks, though it is incremental as it builds on existing methods with a specific improvement.
The paper tackled the problem of distinguishing between similar languages in Twitter messages by incorporating both content analysis and the social graph of authors, achieving state-of-the-art performance of 76.63%, which is 1.4% higher than the top existing system.
Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which is successful for dissimilar language pairs. We propose a label propagation approach that takes the social graph of tweet authors into account as well as content to better tease apart similar languages. This results in state-of-the-art shared task performance of $76.63\%$, $1.4\%$ higher than the top system.