CLMar 1, 2024

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

arXiv:2403.00686v183 citationsh-index: 11SIGUL
Originality Synthesis-oriented
AI Analysis

This work addresses measurement disparities in dataset sizes for researchers and practitioners in multilingual AI, though it is incremental as it focuses on a specific encoding issue.

The paper tackles the problem of comparing text dataset sizes across languages by defining and computing the byte premium, which measures the ratio of bytes needed to encode parallel content in different languages, and provides a tool for 1155 languages to enable more equitable multilingual model development.

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes