CLJan 10, 2025

Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

arXiv:2501.05963v112 citationsh-index: 4NoDaLiDa/Baltic-HLT
Originality Synthesis-oriented
AI Analysis

This provides a practical solution for researchers and practitioners needing to adapt span-annotated datasets to new languages, though it is incremental as it builds on existing MT services.

The authors tackled the problem of machine translating datasets with span-level annotations by applying a simple method using DeepL MT service, producing a Finnish version of SQuAD2.0 and training QA models on it, with evaluations showing consistently better translated data and good performance in downstream tasks.

We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes