CLLGMay 23, 2023

Accessing Higher Dimensions for Unsupervised Word Translation

arXiv:2305.14200v1
Originality Highly original
AI Analysis

This work addresses the challenge of domain mismatch and data efficiency in unsupervised word translation, offering a robust alternative to low-dimensional vector methods.

The paper tackles the problem of unsupervised word translation across languages and domains by proposing coocmap, a method that uses high-dimensional co-occurrence counts, achieving over 50% accuracy with less than 80MB of data and minutes of CPU time for languages like English to Finnish, Hungarian, and Chinese.

The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and useful world knowledge in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought -- less than 80MB and minutes of CPU time is required to achieve over 50\% accuracy for English to Finnish, Hungarian, and Chinese translations when trained on similar data; even under domain mismatch, we show coocmap still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors, and suggest that similarly processed co-occurrences can outperform dense vectors on other tasks too.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes