CLMar 31, 2022

indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages

Anirudh Gupta, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, Priyanshi Shah, Harveen Singh Chadha, Vivek Raghavan

arXiv:2203.16825v11.45 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses readability and downstream NLP task performance for Indic language users, but it is incremental as it applies existing methods like IndicBERT and WFST grammars to new languages.

The authors tackled the problem of missing punctuation in ASR-generated text for Indic languages by developing an automatic punctuation restoration and inverse text normalization framework, achieving results for 11 languages with publicly available code and data.

Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available

View on arXiv PDF Code

Similar