CL LGFeb 23, 2023

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Rahul Gupta, Vivek Srivastava, Mayank Singh

Amazon

arXiv:2302.11766v128.0267 citationsh-index: 20

Originality Synthesis-oriented

AI Analysis

This addresses a gap in resources for code-mixed language research, though it is incremental as it extends existing methods to a new domain.

The paper tackles the lack of long-sequence datasets for code-mixed languages like Hinglish by building MUTANT, a multi-sentential code-mixed Hinglish dataset, resulting in 67k articles with 85k identified Hinglish MCTs.

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.

View on arXiv PDF

Similar