CLAug 10, 2023

Developing an Informal-Formal Persian Corpus

arXiv:2308.05336v14 citationsh-index: 19
Originality Synthesis-oriented
AI Analysis

This addresses the need for informal language processing tools in Persian, which is incremental as it provides a dataset for tasks like conversion and linguistic analysis.

The paper tackled the lack of informal-formal Persian language resources by building a parallel corpus of 50,000 sentence pairs with word/phrase alignments, resulting in 530,000 alignments and a dictionary of 49,397 word/phrase pairs.

Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language faces some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. Such a converter needs a large aligned parallel corpus of colloquial-formal sentences which can be useful for linguists to extract a regulated grammar and orthography for colloquial Persian as is done for the formal language. In this paper we explain our methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level. The sentences were attempted to cover almost all kinds of lexical and syntactic changes between informal and formal Persian, therefore both methods of exploring and collecting from the different resources of informal scripts and following the phonological and morphological patterns of changes were applied to find as much instances as possible. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes