CLMar 31, 2022

indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages

arXiv:2203.16825v15 citations
Originality Synthesis-oriented
AI Analysis

This work addresses readability and downstream NLP task performance for Indic language users, but it is incremental as it applies existing methods like IndicBERT and WFST grammars to new languages.

The authors tackled the problem of missing punctuation in ASR-generated text for Indic languages by developing an automatic punctuation restoration and inverse text normalization framework, achieving results for 11 languages with publicly available code and data.

Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes