CLDec 12, 2024

Neural Text Normalization for Luxembourgish using Real-Life Variation Data

arXiv:2412.09383v223 citationsh-index: 6COLING Workshops
Originality Synthesis-oriented
AI Analysis

This addresses the lack of NLP tools for Luxembourgish due to limited annotated data and ongoing standardization, though it is incremental as it applies existing methods to a new language domain.

The paper tackled the problem of orthographic variation in Luxembourgish texts by developing the first sequence-to-sequence normalization models using ByT5 and mT5 architectures trained on real-life variation data, showing it as an effective approach for tailor-made normalization.

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes