CLOct 31, 2024

Don't Touch My Diacritics

arXiv:2410.24140v210.815 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This addresses a preprocessing issue for multilingual NLP practitioners, but it is incremental as it focuses on improving existing practices rather than introducing new methods.

The paper tackles the problem of inconsistent preprocessing of diacritics in multilingual NLP, demonstrating adverse effects on model performance and calling for community adoption of better practices to improve equity.

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.

View on arXiv PDF

Similar