hinglishNorm -- A Corpus of Hindi-English Code Mixed Sentences for Text Normalization
This addresses the lack of resources for text normalization in Hindi-English code-mixed language, which is incremental as it provides a new dataset rather than a novel method.
The authors created hinglishNorm, the first publicly available corpus of 13,494 Hindi-English code-mixed sentences with human-annotated normalized forms for text normalization, achieving baseline results of 15.55 WER, 71.2 BLEU, and 0.50 METEOR scores.
We present hinglishNorm -- a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 parallel segments. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.