The word entropy of natural languages
This work addresses the need for reliable cross-linguistic comparisons in computational linguistics, though it is incremental as it builds on existing entropy concepts with new data and methods.
The study tackled the problem of estimating word entropy across languages by determining the number of tokens needed for stable entropy values, using parallel texts from 21 languages and extending to over 1000 languages. The results enable quantitative language comparisons, improve multilingual translation system performance, and normalize semantic similarity measures.
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.