SD AI CL MA ASJan 14

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota

arXiv:2601.09413v14.01 citationsh-index: 18

Originality Highly original

AI Analysis

This work addresses the reliability of audio intelligence systems by preventing performance degradation in multi-task learning, offering a practical solution for more resilient speech and audio reasoning applications.

The paper tackles the problem of degraded performance when fine-tuning an omni-model on both speech recognition and external sound understanding tasks by introducing a voice-agentic framework that learns self-reflection to decide when to trust itself versus external audio perception. It achieves a 12.1% WER improvement on speech recognition benchmarks and 77.37% accuracy on audio QA, demonstrating robust generalization.

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

View on arXiv PDF

Similar