60.8SDApr 20
VoxSafeBench: Not Just What Is Said, but Who, How, and WhereYuxiang Wang, Hongyu Liu, Yijiang Xu et al.
For researchers and developers of speech language models, this benchmark exposes a pervasive speech grounding gap where models recognize social norms in text but fail to apply them when cues are grounded in speech.
60.6SDApr 13Code
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and MusicSreyan Ghosh, Arushi Goel, Kaousheik Jayakumar et al.
This work advances open-source audio-language models for researchers and practitioners needing robust understanding of speech, sound, and music, with strong real-world generalization.
59.5ASMay 7Code
WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint ModelingGuanrou Yang, Tian Tan, Qian Chen et al.
For speech AI researchers, WavCube provides a unified representation that bridges the gap between semantic and acoustic features, enabling a single model for both understanding and generation tasks.
57.7ASMar 19
How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic EvaluationKe-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.
This work provides empirical grounding for understanding LLMs in audio research, addressing a gap in knowledge for researchers and practitioners in audio AI.
49.0ASJun 2Code
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform ModelingWenxi Chen, Dongya Jia, Yushen Chen et al.
This work addresses the information loss and non-end-to-end training issues in latent-based TTS by directly modeling raw waveforms, offering a new direction for end-to-end speech generation.
48.6CLApr 1Code
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language ModelsHan Zhu, Lingxuan Ye, Wei Kang et al.
This work addresses the challenge of creating a massive multilingual TTS system for broad language coverage, representing a significant advancement rather than an incremental improvement.
48.0CVApr 26
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion ModelingZhen Ye, Xu Tan, Aoxiong Yin et al.
This work improves the quality and efficiency of talking head synthesis by addressing the suboptimal entanglement of high-level semantics and low-level details in existing joint generation models.
46.3ASMar 16
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and ColloquialnessJingyu Lu, Yuhan Wang, Fan Zhuo et al.
This work addresses evaluation challenges for spoken dialogue systems, offering a novel benchmark and model to better assess conversational quality, though it is incremental in advancing existing reward modeling approaches.
45.7SDJun 3
Audio Interaction ModelZhifei Xie, Zihang Liu, Ze An et al.
This work addresses the need for a single model that can handle multiple streaming audio tasks (e.g., voice chatting, ASR) in real time, unifying capabilities that were previously separate.
43.1CLMar 23Code
TiCo: Time-Controllable Training for Spoken Dialogue ModelsKai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.
This addresses a practical limitation for real-world spoken language systems like voice assistants, where controlling response duration can enhance interaction quality, though it is an incremental improvement.
43.1SDMar 12
Audio-Language Models for Audio-Centric Tasks: A Systematic SurveyYi Su, Jisheng Bai, Qisheng Xu et al.
This is an incremental survey that helps researchers and practitioners in audio-centric AI by summarizing existing technologies and providing references for practical applications.
42.2ASMar 25Code
ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understandingYadong Niu, Tianzi Wang, Heinrich Dinkel et al.
This addresses the need for better training data to develop versatile large audio-language models for general audio understanding, representing an incremental improvement through a new dataset.
41.9CVMay 28
Benchmarking Single-Factor Physical Video-to-Audio GenerationTingle Li, Siddharth Gururani, Kevin J. Shih et al.
For researchers in video-to-audio generation, this work highlights the need to move beyond perceptual quality toward learning physical processes from pixels, though it is incremental in proposing a new evaluation benchmark.
41.2SDMay 31Code
SegTune: Structured and Fine-Grained Control for Song GenerationYuejiao Wang, Zihao Ji, Pengfei Cai et al.
This work addresses the lack of temporally varying control in neural song generation, providing a method for users to specify local musical attributes aligned with song segments.
40.5ASMay 29
A Unified and Reproducible Experimentation Framework for Speech UnderstandingJing Peng, Junhao Du, Chenghao Wang et al.
This framework significantly improves comparability and reproducibility for researchers and developers working on deployment-oriented speech understanding systems.
39.9SDMar 25Code
Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and ModelKangxiang Xia, Bingshen Mu, Xian Shi et al.
This work addresses the problem of natural full-duplex interaction for spoken dialogue systems, offering a significant improvement over existing methods.
38.9SDMar 26Code
MiDashengLM: Efficient Audio Understanding with General Audio CaptionsHeinrich Dinkel, Gang Li, Jizhong Liu et al.
This work addresses the need for transparent and reproducible audio-language models for researchers and practitioners, though it is incremental as it builds on existing open-source components and datasets.
38.5CLMar 30
An Empirical Recipe for Universal Phone RecognitionShikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi et al.
Provides an empirical recipe for universal phone recognition, benefiting multilingual and low-resource speech processing.
37.8ASMar 16Code
SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice SynthesisJiale Qian, Hao Meng, Tian Zheng et al.
It addresses the need for practical, flexible SVS systems in real-world production workflows, though it appears incremental by building on existing SVS methods.
37.6ASMay 31Code
SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language ModelsChenshuang Zhang, Kyeong Seon Kim, Chengxin Liu et al.
For researchers developing audio-visual LLMs, this work identifies a fundamental limitation in cross-modality understanding between speech and vision, highlighting the need for speech-grounded video comprehension.