CLLGFeb 23, 2023

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Amazon
arXiv:2302.11766v1267 citationsh-index: 20
Originality Synthesis-oriented
AI Analysis

This addresses a gap in resources for code-mixed language research, though it is incremental as it extends existing methods to a new domain.

The paper tackles the lack of long-sequence datasets for code-mixed languages like Hinglish by building MUTANT, a multi-sentential code-mixed Hinglish dataset, resulting in 67k articles with 85k identified Hinglish MCTs.

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes