CLOct 18, 2020

hinglishNorm -- A Corpus of Hindi-English Code Mixed Sentences for Text Normalization

arXiv:2010.08974v111 citations
Originality Synthesis-oriented
AI Analysis

This addresses the lack of resources for text normalization in Hindi-English code-mixed language, which is incremental as it provides a new dataset rather than a novel method.

The authors created hinglishNorm, the first publicly available corpus of 13,494 Hindi-English code-mixed sentences with human-annotated normalized forms for text normalization, achieving baseline results of 15.55 WER, 71.2 BLEU, and 0.50 METEOR scores.

We present hinglishNorm -- a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 parallel segments. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes