CL LG SIApr 2, 2023

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Dwip Dalal, Vivek Srivastava, Mayank Singh

arXiv:2304.00634v227.9265 citationsh-index: 14Has Code

Originality Synthesis-oriented

AI Analysis

This provides a resource for researchers working on NLP in multilingual and code-mixed contexts, particularly for Indian social media, but it is incremental as it focuses on dataset creation.

The authors tackled the challenge of processing code-mixed and multilingual social media data by introducing the MMT dataset, which includes 1.7 million tweets with 13 coarse-grained and 63 fine-grained topics, and they demonstrated that existing tools fail on tasks like topic modeling and language identification.

Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we have make the anonymized and annotated dataset available at https://huggingface.co/datasets/LingoIITGN/MMT.

View on arXiv PDF

Similar