CLOct 18, 2021

ViraPart: A Text Refinement Framework for Automatic Speech Recognition and Natural Language Processing Tasks in Persian

Narges Farokhshad, Milad Molazadeh, Saman Jamalabbasi, Hamed Babaei Giglou, Saeed Bibak

arXiv:2110.09086v30.51 citations

Originality Synthesis-oriented

AI Analysis

This work addresses text refinement for Persian language processing, which is incremental as it combines existing techniques into a unified framework.

The authors tackled text refinement in Persian by developing the ViraPart framework, which integrates Zero-Width Non-Joiner recognition, punctuation restoration, and Persian Ezafe construction, achieving averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively.

The Persian language is an inflectional subject-object-verb language. This fact makes Persian a more uncertain language. However, using techniques such as Zero-Width Non-Joiner (ZWNJ) recognition, punctuation restoration, and Persian Ezafe construction will lead us to a more understandable and precise language. In most of the works in Persian, these techniques are addressed individually. Despite that, we believe that for text refinement in Persian, all of these tasks are necessary. In this work, we proposed a ViraPart framework that uses embedded ParsBERT in its core for text clarifications. First, used the BERT variant for Persian following by a classifier layer for classification procedures. Next, we combined models outputs to output cleartext. In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90%, 92.13%, and 98.50%, respectively. Experimental results show that our proposed approach is very effective in text refinement for the Persian language.

View on arXiv PDF

Similar