CLMay 17, 2023

Solving Cosine Similarity Underestimation between High Frequency Words by L2 Norm Discounting

arXiv:2305.10610v18 citations
Originality Incremental advance
AI Analysis

This addresses a specific technical issue in NLP for researchers and practitioners using cosine similarity with MLMs, but it is incremental as it builds on prior observations without introducing a new paradigm.

The paper tackles the problem of cosine similarity underestimation between high-frequency words in contextualized embeddings from models like BERT, proposing an L2 norm discounting method based on word frequency to correct this issue, with experimental results showing accurate solution on a contextualized word similarity dataset.

Cosine similarity between two words, computed using their contextualised token embeddings obtained from masked language models (MLMs) such as BERT has shown to underestimate the actual similarity between those words (Zhou et al., 2022). This similarity underestimation problem is particularly severe for highly frequent words. Although this problem has been noted in prior work, no solution has been proposed thus far. We observe that the L2 norm of contextualised embeddings of a word correlates with its log-frequency in the pretraining corpus. Consequently, the larger L2 norms associated with the highly frequent words reduce the cosine similarity values measured between them, thus underestimating the similarity scores. To solve this issue, we propose a method to discount the L2 norm of a contextualised word embedding by the frequency of that word in a corpus when measuring the cosine similarities between words. We show that the so called stop words behave differently from the rest of the words, which require special consideration during their discounting process. Experimental results on a contextualised word similarity dataset show that our proposed discounting method accurately solves the similarity underestimation problem.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes