AS CLJun 4

Revisiting Lexicon Evaluation in Unsupervised Word Discovery

Simon Malan, Danel Slabbert, Herman Kamper

arXiv:2606.0618356.2

AI Analysis

For researchers in zero-resource speech processing, this work provides more reliable evaluation metrics for lexicon quality, addressing a methodological flaw in existing evaluations.

The paper identifies a bias in the normalized edit distance metric used for evaluating unsupervised word discovery lexicons, and proposes two new metrics that correct for this bias. Experiments show the proposed metrics are more correlated with ground-truth similarity and more robust to evaluation biases.

Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

View on arXiv PDF

Similar