TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP
This work provides a new resource and benchmark for low-resource Tajik-Persian NLP, but the task is shown to be essentially solvable with existing methods.
The paper introduces TajPersLexon, a Tajik-Persian lexical resource of 40,112 pairs, and benchmarks methods for cross-script lexical retrieval, achieving 98-99% top-1 accuracy with neural and retrieval baselines, while a hybrid model reaches 96.4% accuracy in an OCR post-correction task.
This work introduces TajPersLexon, a curated Tajik--Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.