CLATMar 30, 2024

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

arXiv:2404.00500v227 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of analyzing language relationships through embedding shapes, offering a novel approach for computational linguistics, though it is incremental in applying TDA to a new domain.

The paper tackled the problem of quantifying non-isometry in word embeddings by using persistent homology from topological data analysis to measure distances between languages based on the shape of their embedding clouds, and it resulted in reconstructed phylogenetic trees for 81 Indo-European languages that showed strong and statistically significant similarities to reference trees.

Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes