Qun Yang

IT
h-index46
4papers
123citations
Novelty45%
AI Score35

4 Papers

ITDec 14, 2023
LLMind: Orchestrating AI and IoT with LLM for Complex Task Execution

Hongwei Cui, Yuyang Du, Qun Yang et al.

Task-oriented communications are an important element in future intelligent IoT systems. Existing IoT systems, however, are limited in their capacity to handle complex tasks, particularly in their interactions with humans to accomplish these tasks. In this paper, we present LLMind, an LLM-based task-oriented AI agent framework that enables effective collaboration among IoT devices, with humans communicating high-level verbal instructions, to perform complex tasks. Inspired by the functional specialization theory of the brain, our framework integrates an LLM with domain-specific AI modules, enhancing its capabilities. Complex tasks, which may involve collaborations of multiple domain-specific AI modules and IoT devices, are executed through a control script generated by the LLM using a Language-Code transformation approach, which first converts language descriptions to an intermediate finite-state machine (FSM) before final precise transformation to code. Furthermore, the framework incorporates a novel experience accumulation mechanism to enhance response speed and effectiveness, allowing the framework to evolve and become progressively sophisticated through continuing user and machine interactions.

ASAug 23, 2025
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan, Qiulin Li, Yutao Cui et al.

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

SDMay 24, 2025
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt

Zhichao Wu, Yueteng Kang, Songjun Cao et al.

Most existing Zero-Shot Text-To-Speech(ZS-TTS) systems generate the unseen speech based on single prompt, such as reference speech or text descriptions, which limits their flexibility. We propose a customized emotion ZS-TTS system based on multi-modal prompt. The system disentangles speech into the content, timbre, emotion and prosody, allowing emotion prompts to be provided as text, image or speech. To extract emotion information from different prompts, we propose a multi-modal prompt emotion encoder. Additionally, we introduce an prosody predictor to fit the distribution of prosody and propose an emotion consistency loss to preserve emotion information in the predicted prosody. A diffusion-based acoustic model is employed to generate the target mel-spectrogram. Both objective and subjective experiments demonstrate that our system outperforms existing systems in terms of naturalness and similarity. The samples are available at https://mpetts-demo.github.io/mpetts_demo/.

OHApr 25, 2019
A Method for Expressing and Displaying the Vehicle Behavior Distribution in Maintenance Work Zones

Qun Yang, Zhepu Xu, Saravanan Gurupackiam et al.

Maintenance work zones on the road network have impacts on the normal travelling of vehicles, which increase the risk of traffic accidents. The traffic characteristic analysis in maintenance work zones is a basis for maintenance work zone related research such as layout design, traffic control and safety assessment. Due to the difficulty in vehicle microscopic behaviour data acquisition, traditional traffic characteristic analysis mainly focuses on macroscopic characteristics. With the development of data acquisition technology, it becomes much easier to obtain a large amount of microscopic behaviour data nowadays, which lays a good foundation for analysing the traffic characteristics from a new point of view. This paper puts forward a method for expressing and displaying the vehicle behaviour distribution in maintenance work zones. Using portable vehicle microscopic behaviour data acquisition devices, lots of data can be obtained. Based on this data, an endpoint detection technology is used to automatically extract the segments in behaviour data with violent fluctuations, which are segments where vehicles take behaviours such as acceleration or turning. Using the support vector machine classification method, the specific types of behaviours of the segments extracted can be identified, and together with a data combination method, a total of ten types of behaviours can be identified. Then the kernel density analysis is used to cluster different types of behaviours of all passing vehicles to show the distribution on maps. By this method, how vehicles travel through maintenance work zones, and how different vehicle behaviours distribute in maintenance work zones can be displayed intuitively on maps, which is a novel traffic characteristic and can shed light to maintenance work zone related researches such as safety assessment and design method.