CLFeb 25

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

arXiv:2602.21608v11 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses a gap in culturally aware NLP for South Asian social media users, though it is incremental as it focuses on dataset creation and benchmarking.

The researchers tackled the lack of resources for implicit meaning identification in Bangla-English code-mixed social media by creating MixSarc, a corpus of 9,087 annotated sentences, and found strong performance on humor detection but degradation on sarcasm, offense, and vulgarity due to class imbalance, with zero-shot models achieving competitive micro-F1 scores.

Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes