CLAug 4, 2025

A French Version of the OLDI Seed Corpus

Malik Marmonier, Benoît Sagot, Rachel Bawden

arXiv:2508.02290v16.72 citationsh-index: 18Proceedings of the Tenth Conference on Machine Translation

Originality Synthesis-oriented

AI Analysis

This work provides a pivot resource to help collect parallel corpora for under-resourced regional languages in France, but it is incremental as it adapts an existing corpus to a new language.

The authors created the first French version of the OLDI Seed Corpus by using machine translation and post-editing to address translation challenges from technical and user-generated Wikipedia content, resulting in a resource submitted to WMT 2025.

We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

View on arXiv PDF

Similar