CLNov 27, 2019

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

arXiv:1911.12019v1999 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work provides a practical resource for researchers and developers needing bilingual lexicons, though it is incremental as it builds on existing parallel corpora and extraction methods.

The authors tackled the problem of cross-lingual word translation by creating a dataset and Python package called word2word, which provides top-k word translations for 3,564 language pairs across 62 languages, achieving competitive translation quality and high coverage.

We present word2word, a publicly available dataset and an open-source Python package for cross-lingual word translations extracted from sentence-level parallel corpora. Our dataset provides top-k word translations in 3,564 (directed) language pairs across 62 languages in OpenSubtitles2018 (Lison et al., 2018). To obtain this dataset, we use a count-based bilingual lexicon extraction model based on the observation that not only source and target words but also source words themselves can be highly correlated. We illustrate that the resulting bilingual lexicons have high coverage and attain competitive translation quality for several language pairs. We wrap our dataset and model in an easy-to-use Python library, which supports downloading and retrieving top-k word translations in any of the supported language pairs as well as computing top-k word translations for custom parallel corpora.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes