LGCLIRMLJun 26, 2019

Hierarchical Optimal Transport for Document Representation

arXiv:1906.10827v2106 citations
Originality Incremental advance
AI Analysis

This addresses the need for scalable and interpretable document similarity measures for large corpora analysis, representing an incremental improvement over existing methods.

The paper tackles the problem of measuring document similarity by introducing hierarchical optimal transport, which models documents as distributions over topics and words, and shows comparable performance to current methods with better interpretability and scalability at reduced cost.

The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes