42.7AIApr 21
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech InteractionYadong Li, Guoxin Wu, Haiping Hou et al.
Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.
CVJul 8, 2025
AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization FrameworkSuoxiang Zhang, Xiaxi Li, Hongrui Chang et al.
Domain-specific image generation aims to produce high-quality visual content for specialized fields while ensuring semantic accuracy and detail fidelity. However, existing methods exhibit two critical limitations: First, current approaches address prompt engineering and model adaptation separately, overlooking the inherent dependence between semantic understanding and visual representation in specialized domains. Second, these techniques inadequately incorporate domain-specific semantic constraints during content synthesis, resulting in generation outcomes that exhibit hallucinations and semantic deviations. To tackle these issues, we propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding, capturing comprehensive semantic relationships from both global and local perspectives. To mitigate hallucinations in specialized domains, we design a cross-modal adaptation mechanism, which, when combined with intelligent content synthesis, enables preserving core thematic elements while incorporating diverse details across images. Additionally, we introduce a two-phase caption semantic transformation during the generation phase. This approach maintains semantic coherence while enhancing visual diversity, ensuring the generated images adhere to domain-specific constraints. Experimental results confirm our approach's effectiveness, with our framework achieving superior performance across 40 categories from diverse datasets using only 16 images per category, demonstrating significant improvements in image quality, diversity, and semantic consistency.