CLFeb 16

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

arXiv:2602.14675v10.6h-index: 1

Originality Incremental advance

AI Analysis

This work addresses the challenge of evaluating LLMs on endangered languages with non-standard orthography, though it is incremental as it focuses on a specific language and dataset.

The authors tackled the problem of evaluating large language models on non-standard orthography by creating a crowdsourced Piedmontese dataset with 145 parallel sentences and manual word alignment. Their analysis revealed a tokenization penalty for Piedmontese compared to higher-resource Romance languages, with classification performance approaching that of Italian, French, and English, and asymmetric machine translation results where models translate adequately from Piedmontese but struggle to generate into it.

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

View on arXiv PDF

Similar