CL LGMay 12, 2025

TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

Yutong Liu, Feng Xiao, Ziyue Zhang, Yongbin Yu, Cheng Huang, Fan Gao, Xiangxiang Wang, Ma-bao Ban, Manping Fan, Thupten Tsering, Cheng Huang, Gadeng Luosang

arXiv:2505.08037v28.35 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses spelling correction for Tibetan language users, but it is incremental as it builds on existing correction methods by extending them to multi-level errors.

The paper tackles the problem of multi-level Tibetan spelling correction by proposing TiSpell, a semi-masked model that corrects errors at both character and syllable levels, and it outperforms baselines and matches state-of-the-art performance on simulated and real-world data.

Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.

View on arXiv PDF Code

Similar