Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar et al.
This work advances open-source audio-language models for researchers and practitioners needing robust understanding of speech, sound, and music, with strong real-world generalization.
Audio processing, speech, music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar et al.
This work advances open-source audio-language models for researchers and practitioners needing robust understanding of speech, sound, and music, with strong real-world generalization.
Dingdong Wang, Shujie Liu, Tianhua Zhang et al.
This work addresses the problem of explainable emotion understanding in speech for applications in multimodal AI, representing a novel approach rather than an incremental improvement.
Heinrich Dinkel, Xingwei Sun, Gang Li et al. · apple-ml
This work addresses the need for a single model for both audio understanding and generation, offering a novel paradigm that could simplify audio processing pipelines.
Wenxi Chen, Dongya Jia, Yushen Chen et al.
This work addresses the information loss and non-end-to-end training issues in latent-based TTS by directly modeling raw waveforms, offering a new direction for end-to-end speech generation.
Kangxiang Xia, Bingshen Mu, Xian Shi et al.
This work addresses the problem of natural full-duplex interaction for spoken dialogue systems, offering a significant improvement over existing methods.
Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem et al.
This addresses the need for modular and efficient control in audio-visual generation for researchers and practitioners, offering a significant improvement over monolithic or costly methods.
Yitian Gong, Botian Jiang, Yiwei Zhao et al.
This work addresses the need for efficient and controllable text-to-speech systems, though it appears incremental as it builds on existing tokenization and transformer methods.
Xiang He, Chenxing Li, Jinting Wang et al.
For researchers in audio-language models, this work addresses the lack of explicit reasoning processes by enabling emergent chain-of-thought reasoning through reinforcement learning, outperforming existing methods on multiple benchmarks.
Zeyue Tian, Binxin Yang, Zhaoyang Liu et al.
This work addresses the lack of a unified framework for audio generation, editing, and understanding, providing a versatile solution that matches specialized models across multiple domains.
Xinyuan Xie, Shunian Chen, Zhiheng Liu et al.
For researchers and practitioners in audio understanding, EvA demonstrates that preserving acoustic evidence before reasoning is critical for LALM performance, offering a new paradigm to address the evidence bottleneck.
Aviad Dahan, Moran Yanuka, Noa Kraicer et al. · apple-ml
It addresses the challenge of synchronizing personalized audio with video for content creators, offering a novel integrated approach rather than incremental improvements.
Peng Wang, Yanqiao Zhu, Zixuan Jiang et al.
This work addresses the problem of semantic evaluation and human-like interaction in ASR for researchers and practitioners, representing a novel integration of agentic frameworks rather than an incremental improvement.
Zijian Ling, Pingyi Hu, Xiuyong Gao et al.
This addresses a critical security problem for users of speech-driven LLMs by demonstrating practical, black-box attacks that are perceptually undetectable, though it is incremental in applying known acoustic techniques to a new domain.
Zhiwei Chen, Yijie Li, Yimo Zhang et al.
This system addresses the challenge of robust material identification for embodied intelligence by mitigating geometric variations, which is an incremental improvement for robotics and human-computer interaction.
Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar et al.
This addresses a critical reliability gap for users of audio-language AI systems, exposing vulnerabilities that standard benchmarks miss, though it is incremental as it builds on existing attack and mitigation frameworks.
Anuj Diwan, Eunsol Choi, David Harwath
This addresses the need for richer stylistic language-audio pretraining in speech processing, offering improvements over existing models that handle only a narrow set of descriptors, though it appears incremental in extending contrastive learning to more style dimensions.
Tao Yu, yiming ding, Shenghua Chai et al.
For researchers developing multimodal agents, this benchmark highlights a critical gap in audio-driven cross-modal search and reasoning capabilities.
Yuxiang Wang, Hongyu Liu, Yijiang Xu et al.
For researchers and developers of speech language models, this benchmark exposes a pervasive speech grounding gap where models recognize social norms in text but fail to apply them when cues are grounded in speech.
Yanyun Wang, Yu Huang, Zi Liang et al.
This work identifies a fundamental vulnerability in cross-modal safety alignment of LALMs, enabling universal jailbreak without instance-specific optimization, which is critical for security of multimodal AI systems.
Tianle Liang, Yifu Chen, Shengpeng Ji et al.
This work addresses the need for tool-augmented reasoning in end-to-end spoken dialogue models, enabling them to handle complex real-world tasks.