LGCLIRMay 30, 2021

Re-evaluating Word Mover's Distance

arXiv:2105.14403v325 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a potential evaluation issue in document similarity measurement for researchers and practitioners, but it is incremental as it revisits and refines existing methods.

The paper re-evaluates Word Mover's Distance (WMD) and finds that classical baselines like bag-of-words and TF-IDF are competitive with WMD when using appropriate preprocessing such as L1 normalization, challenging the original study's claims of significant outperformance.

The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, we introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes