IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

arXiv:2606.0655921.1

Originality Incremental advance

AI Analysis

For developers of voice agents, this work improves robustness to acoustic interference in real-time spoken dialogue, but the gains are incremental over existing fusion approaches.

The paper addresses the problem of interference from other speakers degrading end-to-end full-duplex spoken dialogue systems, and proposes IRAF, a lightweight module that adaptively gates user audio contributions to the LLM. Experiments show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

View on arXiv PDF

Similar