CLApr 30, 2024

Modeling Orthographic Variation in Occitan's Dialects

arXiv:2404.19315v114.428 citationsh-index: 2VarDial

Originality Incremental advance

AI Analysis

This addresses the problem of handling orthographic variation for low-resource languages, offering a method to reduce pre-processing needs, though it is incremental as it builds on existing multilingual models.

The study tackled the challenge of normalizing textual data for low-resource languages like Occitan dialects by fine-tuning a multilingual model, finding that the model's embeddings improved with surface similarity between dialects and it performed robustly in part-of-speech tagging and parsing across dialects without spelling normalization.

Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.

View on arXiv PDF

Similar