cs.SDComputer Science

Sound

Audio processing, speech, music

100.0SDApr 13Code

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar et al.

This work advances open-source audio-language models for researchers and practitioners needing robust understanding of speech, sound, and music, with strong real-world generalization.

99.8SDMar 16Code

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

Dingdong Wang, Shujie Liu, Tianhua Zhang et al.

This work addresses the problem of explainable emotion understanding in speech for applications in multimodal AI, representing a novel approach rather than an incremental improvement.

99.6SDMar 26Code

DashengTokenizer: One layer is enough for unified audio understanding and generation

Heinrich Dinkel, Xingwei Sun, Gang Li et al. · apple-ml

This work addresses the need for a single model for both audio understanding and generation, offering a novel paradigm that could simplify audio processing pipelines.

100.0ASJun 2Code

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

Wenxi Chen, Dongya Jia, Yushen Chen et al.

This work addresses the information loss and non-end-to-end training issues in latent-based TTS by directly modeling raw waveforms, offering a new direction for end-to-end speech generation.

99.4SDMar 25Code

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Kangxiang Xia, Bingshen Mu, Xian Shi et al.

This work addresses the problem of natural full-duplex interaction for spoken dialogue systems, offering a significant improvement over existing methods.

99.3CVMar 25

AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem et al.

This addresses the need for modular and efficient control in audio-visual generation for researchers and practitioners, offering a significant improvement over monolithic or costly methods.

99.2SDMar 18

MOSS-TTS Technical Report

Yitian Gong, Botian Jiang, Yiwei Zhao et al.

This work addresses the need for efficient and controllable text-to-speech systems, though it appears incremental as it builds on existing tokenization and transformer methods.

99.1SDApr 20

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang et al.

For researchers in audio-language models, this work addresses the lack of explicit reasoning processes by enabling emergent chain-of-thought reasoning through reinforcement learning, outperforming existing methods on multiple benchmarks.

98.9SDApr 12

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu et al.

This work addresses the lack of a unified framework for audio generation, editing, and understanding, providing a versatile solution that matches specialized models across multiple domains.

98.7SDMar 29Code

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Xinyuan Xie, Shunian Chen, Zhiheng Liu et al.

For researchers and practitioners in audio understanding, EvA demonstrates that preserving acoustic evidence before reasoning is critical for LALM performance, offering a new paradigm to address the evidence bottleneck.

98.5SDMar 10

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer et al. · apple-ml

It addresses the challenge of synchronizing personalized audio with video for content creators, offering a novel integrated approach rather than incremental improvements.

98.7CLApr 10

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

Peng Wang, Yanqiao Zhu, Zixuan Jiang et al.

This work addresses the problem of semantic evaluation and human-like interaction in ASR for researchers and practitioners, representing a novel integration of agentic frameworks rather than an incremental improvement.

99.1CRMar 14Code

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Zijian Ling, Pingyi Hu, Xiuyong Gao et al.

This addresses a critical security problem for users of speech-driven LLMs by demonstrating practical, black-box attacks that are perceptually undetectable, though it is incremental in applying known acoustic techniques to a new domain.

100.0ETMay 29

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

Zhiwei Chen, Yijie Li, Yimo Zhang et al.

This system addresses the challenge of robust material identification for embodied intelligence by mitigating geometric variations, which is an incremental improvement for robotics and human-computer interaction.

98.3SDMar 31

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar et al.

This addresses a critical reliability gap for users of audio-language AI systems, exposing vulnerabilities that standard benchmarks miss, though it is incremental as it builds on existing attack and mitigation frameworks.

98.4ASMar 30Code

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath

This addresses the need for richer stylistic language-audio pretraining in speech processing, offering improvements over existing models that handle only a narrow set of descriptors, though it appears incremental in extending contrastive learning to more style dimensions.

98.1SDMay 9Code

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu, yiming ding, Shenghua Chai et al.

For researchers developing multimodal agents, this benchmark highlights a critical gap in audio-driven cross-modal search and reasoning capabilities.

97.9SDApr 20

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang, Hongyu Liu, Yijiang Xu et al.

For researchers and developers of speech language models, this benchmark exposes a pervasive speech grounding gap where models recognize social norms in text but fail to apply them when cues are grounded in speech.

98.3CRMay 18

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

Yanyun Wang, Yu Huang, Zi Liang et al.

This work identifies a fundamental vulnerability in cross-modal safety alignment of LALMs, enabling universal jailbreak without instance-specific optimization, which is critical for security of multimodal AI systems.

97.7SDApr 17Code

VoxMind: An End-to-End Agentic Spoken Dialogue System

Tianle Liang, Yifu Chen, Shengpeng Ji et al.

This work addresses the need for tool-augmented reasoning in end-to-end spoken dialogue models, enabling them to handle complex real-world tasks.