CL LGDec 30, 2023

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi

arXiv:2401.00170v11.01 citationsh-index: 2Has CodeFire

Originality Synthesis-oriented

AI Analysis

It addresses the lack of social media NER resources for Marathi, enabling public opinion analysis and marketing applications, but is incremental as it applies existing methods to new data.

This work introduces L3Cube-MahaSocialNER, the first and largest social media dataset for Named Entity Recognition in Marathi, comprising 18,000 labeled sentences across eight entity classes, and shows that deep learning models like CNN, LSTM, BiLSTM, and Transformers achieve accurate entity recognition in informal text.

This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP

View on arXiv PDF Code

Similar