CL ATMar 30, 2024

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

arXiv:2404.00500v214.127 citationsh-index: 3Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of analyzing language relationships through embedding shapes, offering a novel approach for computational linguistics, though it is incremental in applying TDA to a new domain.

The paper tackled the problem of quantifying non-isometry in word embeddings by using persistent homology from topological data analysis to measure distances between languages based on the shape of their embedding clouds, and it resulted in reconstructed phylogenetic trees for 81 Indo-European languages that showed strong and statistically significant similarities to reference trees.

Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.

View on arXiv PDF Code

Similar