CLLGJan 23, 2023

Noisy Parallel Data Alignment

CMU
arXiv:2301.09685v2268 citationsh-index: 33
Originality Incremental advance
AI Analysis

This work addresses the challenge of processing endangered and under-resourced languages by improving alignment robustness, though it is incremental as it builds on existing models.

The paper tackled the problem of word alignment models failing under noisy OCR conditions for under-resourced languages, achieving up to a 59.6% reduction in alignment error rate with a noise simulation and structural biasing method.

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes