indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages
This work addresses readability and downstream NLP task performance for Indic language users, but it is incremental as it applies existing methods like IndicBERT and WFST grammars to new languages.
The authors tackled the problem of missing punctuation in ASR-generated text for Indic languages by developing an automatic punctuation restoration and inverse text normalization framework, achieving results for 11 languages with publicly available code and data.
Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available