CLASJun 3

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

arXiv:2606.0447450.3
AI Analysis

Identifies and resolves a specific bottleneck in speech LLM reasoning, enabling parity with text LLMs on logical tasks.

Speech LLMs underperform text LLMs on logical reasoning due to entity binding failures, where continuous speech features disrupt entity-property associations. The proposed Entity-Aware Chain-of-Thought (EA-CoT) intervention bridges this gap, yielding up to 24.4% absolute accuracy improvement.

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes