Challenges in Persian Electronic Text Analysis
This addresses problems for researchers and developers working with Persian language data, but it is incremental as it focuses on known issues without introducing new solutions.
The paper identifies key challenges in analyzing Persian electronic texts, particularly in transcription and encoding during corpus development, highlighting their crucial impact on processing written corpora.
Farsi, also known as Persian, is the official language of Iran and Tajikistan and one of the two main languages spoken in Afghanistan. Farsi enjoys a unified Arabic script as its writing system. In this paper we briefly introduce the writing standards of Farsi and highlight problems one would face when analyzing Farsi electronic texts, especially during development of Farsi corpora regarding to transcription and encoding of Farsi e-texts. The pointes mentioned may sounds easy but they are crucial when developing and processing written corpora of Farsi.