eess.ASElectrical Engineering

Audio & Speech Processing

Speech recognition, audio signal processing

60.8SDApr 20

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang, Hongyu Liu, Yijiang Xu et al.

For researchers and developers of speech language models, this benchmark exposes a pervasive speech grounding gap where models recognize social norms in text but fail to apply them when cues are grounded in speech.

60.6SDApr 13Code

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar et al.

This work advances open-source audio-language models for researchers and practitioners needing robust understanding of speech, sound, and music, with strong real-world generalization.

59.5ASMay 7Code

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Guanrou Yang, Tian Tan, Qian Chen et al.

For speech AI researchers, WavCube provides a unified representation that bridges the gap between semantic and acoustic features, enabling a single model for both understanding and generation tasks.

57.7ASMar 19

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.

This work provides empirical grounding for understanding LLMs in audio research, addressing a gap in knowledge for researchers and practitioners in audio AI.

49.0ASJun 2Code

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

Wenxi Chen, Dongya Jia, Yushen Chen et al.

This work addresses the information loss and non-end-to-end training issues in latent-based TTS by directly modeling raw waveforms, offering a new direction for end-to-end speech generation.

48.6CLApr 1Code

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang et al.

This work addresses the challenge of creating a massive multilingual TTS system for broad language coverage, representing a significant advancement rather than an incremental improvement.

48.0CVApr 26

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye, Xu Tan, Aoxiong Yin et al.

This work improves the quality and efficiency of talking head synthesis by addressing the suboptimal entanglement of high-level semantics and low-level details in existing joint generation models.

46.3ASMar 16

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Jingyu Lu, Yuhan Wang, Fan Zhuo et al.

This work addresses evaluation challenges for spoken dialogue systems, offering a novel benchmark and model to better assess conversational quality, though it is incremental in advancing existing reward modeling approaches.

45.7SDJun 3

Audio Interaction Model

Zhifei Xie, Zihang Liu, Ze An et al.

This work addresses the need for a single model that can handle multiple streaming audio tasks (e.g., voice chatting, ASR) in real time, unifying capabilities that were previously separate.

43.1CLMar 23Code

TiCo: Time-Controllable Training for Spoken Dialogue Models

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.

This addresses a practical limitation for real-world spoken language systems like voice assistants, where controlling response duration can enhance interaction quality, though it is an incremental improvement.

43.1SDMar 12

Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

Yi Su, Jisheng Bai, Qisheng Xu et al.

This is an incremental survey that helps researchers and practitioners in audio-centric AI by summarizing existing technologies and providing references for practical applications.

42.2ASMar 25Code

ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Yadong Niu, Tianzi Wang, Heinrich Dinkel et al.

This addresses the need for better training data to develop versatile large audio-language models for general audio understanding, representing an incremental improvement through a new dataset.

41.9CVMay 28

Benchmarking Single-Factor Physical Video-to-Audio Generation

Tingle Li, Siddharth Gururani, Kevin J. Shih et al.

For researchers in video-to-audio generation, this work highlights the need to move beyond perceptual quality toward learning physical processes from pixels, though it is incremental in proposing a new evaluation benchmark.

41.2SDMay 31Code

SegTune: Structured and Fine-Grained Control for Song Generation

Yuejiao Wang, Zihao Ji, Pengfei Cai et al.

This work addresses the lack of temporally varying control in neural song generation, providing a method for users to specify local musical attributes aligned with song segments.

40.5ASMay 29

A Unified and Reproducible Experimentation Framework for Speech Understanding

Jing Peng, Junhao Du, Chenghao Wang et al.

This framework significantly improves comparability and reproducibility for researchers and developers working on deployment-oriented speech understanding systems.

39.9SDMar 25Code

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Kangxiang Xia, Bingshen Mu, Xian Shi et al.

This work addresses the problem of natural full-duplex interaction for spoken dialogue systems, offering a significant improvement over existing methods.

38.9SDMar 26Code

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel, Gang Li, Jizhong Liu et al.

This work addresses the need for transparent and reproducible audio-language models for researchers and practitioners, though it is incremental as it builds on existing open-source components and datasets.

38.5CLMar 30

An Empirical Recipe for Universal Phone Recognition

Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi et al.

Provides an empirical recipe for universal phone recognition, benefiting multilingual and low-resource speech processing.

37.8ASMar 16Code

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Jiale Qian, Hao Meng, Tian Zheng et al.

It addresses the need for practical, flexible SVS systems in real-world production workflows, though it appears incremental by building on existing SVS methods.

37.6ASMay 31Code

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu et al.

For researchers developing audio-visual LLMs, this work identifies a fundamental limitation in cross-modality understanding between speech and vision, highlighting the need for speech-grounded video comprehension.