CL LGJul 2, 2021

Language Identification of Hindi-English tweets using code-mixed BERT

Mohd Zeeshan Ansari, M M Sufyan Beg, Tanvir Ahmad, Mohd Jazib Khan, Ghazali Wasim

arXiv:2107.01202v10.720 citations

Originality Synthesis-oriented

AI Analysis

This addresses language identification for social media users in Hindi-English contexts, but is incremental as it applies existing BERT methods to a specific code-mixed dataset.

The paper tackled language identification in Hindi-English code-mixed tweets by fine-tuning BERT models pre-trained on code-mixed data, showing that these representations outperform monolingual counterparts.

Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.

View on arXiv PDF

Similar