CLAPOct 23, 2023

Characterizing how 'distributional' NLP corpora distance metrics are

IBM
arXiv:2310.14829v1h-index: 18
Originality Incremental advance
AI Analysis

This work addresses a methodological gap for researchers in NLP and machine learning who need to compare corpora, though it is incremental in refining existing metrics.

The paper tackles the problem of evaluating how well distance metrics capture the overall distributional differences between text corpora, proposing a method to quantify 'distributionality' and identifying Average Hausdorff Distance as non-distributional and energy distance as distributional examples.

A corpus of vector-embedded text documents has some empirical distribution. Given two corpora, we want to calculate a single metric of distance (e.g., Mauve, Frechet Inception) between them. We describe an abstract quality, called `distributionality', of such metrics. A non-distributional metric tends to use very local measurements, or uses global measurements in a way that does not fully reflect the distributions' true distance. For example, if individual pairwise nearest-neighbor distances are low, it may judge the two corpora to have low distance, even if their two distributions are in fact far from each other. A more distributional metric will, in contrast, better capture the distributions' overall distance. We quantify this quality by constructing a Known-Similarity Corpora set from two paraphrase corpora and calculating the distance between paired corpora from it. The distances' trend shape as set element separation increases should quantify the distributionality of the metric. We propose that Average Hausdorff Distance and energy distance between corpora are representative examples of non-distributional and distributional distance metrics, to which other metrics can be compared, to evaluate how distributional they are.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes