CLMay 7

TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

arXiv:2605.068866.8
Predicted impact top 86% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work provides a new resource and benchmark for low-resource Tajik-Persian NLP, but the task is shown to be essentially solvable with existing methods.

The paper introduces TajPersLexon, a Tajik-Persian lexical resource of 40,112 pairs, and benchmarks methods for cross-script lexical retrieval, achieving 98-99% top-1 accuracy with neural and retrieval baselines, while a hybrid model reaches 96.4% accuracy in an OCR post-correction task.

This work introduces TajPersLexon, a curated Tajik--Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes