Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
This work addresses the challenge of creating high-performance embeddings for resource-constrained communities, such as those using languages like Armenian, by demonstrating that minimal data suffices, which is incremental but impactful for democratizing access.
The paper tackles the problem of training effective text embedding models for low-resource languages by showing that fine-tuning on a small scale of noisy synthetic data (10,000 pairs) yields 11-12% average improvements and matches performance of models trained on much larger datasets, with a 20%+ relative improvement in retrieval performance.
Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12\% average improvements across the benchmark with a 20\%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at https://metric-ai-lab.github.io/less-is-more-embeddings/ to facilitate further research.