CLAIJan 9

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

MIT
arXiv:2601.06329v11 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses a critical evaluation issue for researchers and developers in spoken language modeling, though it is incremental as it focuses on improving existing metrics rather than introducing new models.

The authors tackled the problem of evaluating generative spoken language models by showing that the commonly used global token perplexity metric fails to capture speech-specific characteristics, leading to inaccurate assessments. They proposed new evaluation methods that correlate better with human ratings, reducing the performance gap between top models and human baselines.

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes