h-index5
17papers
279citations
Novelty54%
AI Score55

17 Papers

SESep 30, 2024Code
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation

Ziyao Zhang, Yanlin Wang, Chong Wang et al.

Code generation aims to automatically generate code from input requirements, significantly enhancing development efficiency. Recent large language models (LLMs) based approaches have shown promising results and revolutionized code generation task. Despite the promising performance, LLMs often generate contents with hallucinations, especially for the code generation scenario requiring the handling of complex contextual dependencies in practical development process. Although previous study has analyzed hallucinations in LLM-powered code generation, the study is limited to standalone function generation. In this paper, we conduct an empirical study to study the phenomena, mechanism, and mitigation of LLM hallucinations within more practical and complex development contexts in repository-level generation scenario. First, we manually examine the code generation results from six mainstream LLMs to establish a hallucination taxonomy of LLM-generated code. Next, we elaborate on the phenomenon of hallucinations, analyze their distribution across different models. We then analyze causes of hallucinations and identify four potential factors contributing to hallucinations. Finally, we propose an RAG-based mitigation method, which demonstrates consistent effectiveness in all studied LLMs. The replication package including code, data, and experimental results is available at https://github.com/DeepSoftwareAnalytics/LLMCodingHallucination

IVAug 5, 2022
Multimodal Brain Disease Classification with Functional Interaction Learning from Single fMRI Volume

Wei Dai, Ziyao Zhang, Lixia Tian et al.

In neuroimaging analysis, fMRI can well assess the function changes for brain diseases with no obvious structural lesions. To date, most deep-learning-based fMRI studies have employed functional connectivity (FC) as the basic feature for disease classification. However, FC is calculated on time series of predefined regions of interest and neglects detailed information contained in each voxel. Another drawback of using FC is the limited sample size for the training of deep models. The low representation ability of FC leads to poor performance in clinical practice, especially when dealing with multimodal medical data involving multiple types of visual signals and textual records for brain diseases. To overcome this bottleneck problem in the fMRI feature modality, we propose BrainFormer, an end-to-end functional interaction learning method for brain disease classification with single fMRI volume. Unlike traditional deep learning methods that construct convolution and transformers on FC, BrainFormer learns the functional interaction from fMRI signals, by modeling the local cues within each voxel with 3D convolutions and capturing the global correlations among distant regions with specially designed global attention mechanisms from shallow layers to deep layers. Meanwhile, BrainFormer can deal with multimodal medical data including fMRI volume, structural MRI, FC features and phenotypic data to achieve more comprehensive brain disease diagnosis. We evaluate BrainFormer on five independent multi-site datasets on autism, Alzheimer's disease, depression, attention deficit hyperactivity disorder and headache disorders. The results demonstrate its effectiveness and generalizability for multiple brain diseases diagnosis with multimodal features. BrainFormer may promote precision of neuroimaging-based diagnosis in clinical practice and motivate future studies on fMRI analysis.

CVFeb 23Code
ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma, Ziyao Zhang, Haofei Wang et al.

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

ASJul 4, 2022
Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)

Ariadna Sanchez, Alessio Falai, Ziyao Zhang et al.

An essential design decision for multilingual Neural Text-To-Speech (NTTS) systems is how to represent input linguistic features within the model. Looking at the wide variety of approaches in the literature, two main paradigms emerge, unified and separate representations. The former uses a shared set of phonetic tokens across languages, whereas the latter uses unique phonetic tokens for each language. In this paper, we conduct a comprehensive study comparing multilingual NTTS systems models trained with both representations. Our results reveal that the unified approach consistently achieves better cross-lingual synthesis with respect to both naturalness and accent. Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity. For this reason, we carry out an ablation study to understand the interaction of the representation type with the size of the token embedding. We find that the difference between the two paradigms only emerges above a certain threshold embedding size. This study provides strong evidence that unified representations should be the preferred paradigm when building multilingual NTTS systems.

ASJul 4, 2022
Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)

Ziyao Zhang, Alessio Falai, Ariadna Sanchez et al.

Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems. In order to train these models, it is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis. In this context, it is common to hear questions such as "Would including more Spanish data help my Italian synthesis, given the closeness of both languages?". Unfortunately, we found existing literature on the topic lacking in completeness in this regard. In the present work, we conduct an extensive ablation study aimed at understanding how various factors of the training corpora, such as language family affiliation, gender composition, and the number of speakers, contribute to the quality of Polyglot synthesis. Our findings include the observation that female speaker data are preferred in most scenarios, and that it is not always beneficial to have more speakers from the target language variant in the training corpus. The findings herein are informative for the process of data procurement and corpora building.

ASSep 15, 2023
Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai et al.

In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

84.4SEApr 16
SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

Hao Han, Jin Xie, Xuehao Ma et al.

Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally prohibitive inference scaling, which collectively exacerbate token bloat, reward hacking, and policy degradation. We present SWE-TRACE (Trajectory Reduction and Agentic Criteria Evaluation), a unified framework optimizing the SWE agent lifecycle across data curation, reinforcement learning (RL), and test-time inference. First, we introduce an LLM multi-task cascading method, utilizing stepwise oracle verification to distill a 60K-instance Supervised Fine-Tuning (SFT) corpus strictly biased toward token-efficient, shortest-path trajectories. Second, to overcome the instability of sparse outcome rewards, we design a MemoryAugmented Agentic RL pipeline featuring a Rubric-Based Process Reward Model (PRM). An auxiliary Rubric-Agent provides dense, fine-grained heuristic feedback on intermediate steps, guiding the model through long-horizon tasks. Finally, we bridge training and inference by repurposing the PRM for heuristic-guided Test-Time Scaling (TTS). By dynamically evaluating and pruning action candidates at each step, SWE-TRACE achieves superior search efficiency without the latency overhead of standard parallel sampling. Extensive experiments on standard SWE benchmarks demonstrate that SWE-TRACE significantly advances the state-of-the-art, maximizing resolution rates while drastically reducing both token consumption and inference latency.

43.1HCMar 29
PACEE: Parent-Centered AI Scaffolding for Emotion Education in Early Childhood Conversations

Yu Mei, Xutong Wang, Ziyao Zhang et al.

Emotion education is critical for children aged 3 to 6. However, existing technologies largely focus on children's direct interaction with AI, overlooking the central role of parents in guiding early emotional development at home. To address this gap, we conducted co-design sessions with five kindergarten teachers and five parents to identify key parental challenges and opportunities for AI support in family emotion education. Based on these insights, we developed PACEE, an LLM-based assistant designed to support parents in guiding children's emotional development through conversations, rather than directly interacting with children. PACEE provides parent-centered AI scaffolding that supports parents in real-time conversation through personalized guidance, post-hoc reflection through trackable feedback, and understanding children's emotional states through modeling. We evaluated PACEE with 16 families. Results show that PACEE enhances parent-child engagement, fosters deeper emotional communication, and improves parents' expertise and overall experience in guiding their children. Our findings extend emotion coaching practices to the context of generative AI and offer design insights for building AI systems that support parent-centered family education.

44.4HCMar 29
Adapting AI to the Moment: Understanding the Dynamics of Parent-AI Collaboration Modes in Real-Time Conversations with Children

Yu Mei, Ziyao Zhang, Qingyang Wan et al.

Parent-AI collaboration to support real-time conversations with children is challenging due to the sensitivity and open-ended nature of such interactions. Existing systems often simplify collaboration into static modes, providing limited support for adapting AI to continuously evolving conversational contexts. To address this gap, we systematically investigate the dynamics of parent-AI collaboration modes in real-time conversations with children. We conducted a co-design study with eight parents and developed COMPASS, a research probe that enables flexible combinations of parental support functions during conversations. Using COMPASS, we conducted a lab-based study with 21 parent-child pairs. We show that parent-AI collaboration unfolds through evolving modes that adapt systematically to contextual factors. We further identify three types of parental strategies--parent-oriented, child-oriented, and relationship-oriented--that shape how parents engage with AI. These findings advance the understanding of dynamic human-AI collaboration in relational, high-stakes settings and inform the design of flexible, context-adaptive parental support systems.

CVFeb 11
Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang, Ziyao Zhang, Zhiying Leng et al.

We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

CVFeb 8, 2025
Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Yin Wang, Mu Li, Jiapeng Liu et al.

We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

ASAug 25, 2025
Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

Alessio Falai, Ziyao Zhang, Akos Gangoly

In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.

CVJul 18, 2025
Training-free Token Reduction for Vision Mamba

Qiankun Ma, Ziyao Zhang, Chi Su et al.

Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba's efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free \textbf{M}amba \textbf{T}oken \textbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40\% on the Vim-B backbone, with only a 1.6\% drop in ImageNet performance without retraining.

IVFeb 3, 2022
PARCEL: Physics-based Unsupervised Contrastive Representation Learning for Multi-coil MR Imaging

Shanshan Wang, Ruoyou Wu, Cheng Li et al.

With the successful application of deep learning to magnetic resonance (MR) imaging, parallel imaging techniques based on neural networks have attracted wide attention. However, in the absence of high-quality, fully sampled datasets for training, the performance of these methods is limited. And the interpretability of models is not strong enough. To tackle this issue, this paper proposes a Physics-bAsed unsupeRvised Contrastive rEpresentation Learning (PARCEL) method to speed up parallel MR imaging. Specifically, PARCEL has a parallel framework to contrastively learn two branches of model-based unrolling networks from augmented undersampled multi-coil k-space data. A sophisticated co-training loss with three essential components has been designed to guide the two networks in capturing the inherent features and representations for MR images. And the final MR image is reconstructed with the trained contrastive networks. PARCEL was evaluated on two vivo datasets and compared to five state-of-the-art methods. The results show that PARCEL is able to learn essential representations for accurate MR reconstruction without relying on fully sampled datasets.

LGJun 5, 2020
State Action Separable Reinforcement Learning

Ziyao Zhang, Liang Ma, Kin K. Leung et al.

Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.

NIJan 9, 2020
Neural Network Tomography

Liang Ma, Ziyao Zhang, Mudhakar Srivatsa

Network tomography, a classic research problem in the realm of network monitoring, refers to the methodology of inferring unmeasured network attributes using selected end-to-end path measurements. In the research community, network tomography is generally investigated under the assumptions of known network topology, correlated path measurements, bounded number of faulty nodes/links, or even special network protocol support. The applicability of network tomography is considerably constrained by these strong assumptions, which therefore frequently position it in the theoretical world. In this regard, we revisit network tomography from the practical perspective by establishing a generic framework that does not rely on any of these assumptions or the types of performance metrics. Given only the end-to-end path performance metrics of sampled node pairs, the proposed framework, NeuTomography, utilizes deep neural network and data augmentation to predict the unmeasured performance metrics via learning non-linear relationships between node pairs and underlying unknown topological/routing properties. In addition, NeuTomography can be employed to reconstruct the original network topology, which is critical to most network planning tasks. Extensive experiments using real network data show that comparing to baseline solutions, NeuTomography can predict network characteristics and reconstruct network topologies with significantly higher accuracy and robustness using only limited measurement data.

NISep 19, 2019
MACS: Deep Reinforcement Learning based SDN Controller Synchronization Policy Design

Ziyao Zhang, Liang Ma, Konstantinos Poularakis et al.

In distributed software-defined networks (SDN), multiple physical SDN controllers, each managing a network domain, are implemented to balance centralised control, scalability, and reliability requirements. In such networking paradigms, controllers synchronize with each other, in attempts to maintain a logically centralised network view. Despite the presence of various design proposals for distributed SDN controller architectures, most existing works only aim at eliminating anomalies arising from the inconsistencies in different controllers' network views. However, the performance aspect of controller synchronization designs with respect to given SDN applications are generally missing. To fill this gap, we formulate the controller synchronization problem as a Markov decision process (MDP) and apply reinforcement learning techniques combined with deep neural networks (DNNs) to train a smart, scalable, and fine-grained controller synchronization policy, called the Multi-Armed Cooperative Synchronization (MACS), whose goal is to maximise the performance enhancements brought by controller synchronizations. Evaluation results confirm the DNN's exceptional ability in abstracting latent patterns in the distributed SDN environment, rendering significant superiority to MACS-based synchronization policy, which are 56% and 30% performance improvements over ONOS and greedy SDN controller synchronization heuristics.