Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
This provides a data-driven, language-agnostic tool for linguists and NLP researchers to objectively estimate word senses, though it is incremental in automating existing human rankings.
The paper tackles the subjective problem of quantifying word polysemy by proposing a novel unsupervised method based on geometry in contextual embedding spaces, achieving strong statistical correlations with six human-constructed rankings across standard metrics.
The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available at https://github.com/ksipos/polysemy-assessment . The paper was accepted as a long paper at EACL 2021.