Automatic Standardization of Colloquial Persian
This work is significant for Persian NLP researchers and developers, as it tackles the practical problem of colloquial text standardization, which is a common bottleneck for existing tools.
This paper addresses the problem of standardizing colloquial Persian text, which is often encountered in web content, for use with NLP tools. The authors developed a sequence-to-sequence translation approach, trained on artificially generated parallel data, achieving a BLEU score of 62.8, outperforming a rule-based model (61.7 BLEU) and significantly improving over the original text (46.4 BLEU). Additionally, their model improved English-to-Persian machine translation by 0.8 BLEU on test data when the training data was colloquial.
The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data.