Haoyu Gao

h-index3

4papers

96citations

Novelty45%

AI Score43

Ranked #55,244 of 194,257 authors (top 28%)#524 in SE (top 17%)

4 Papers

17.4CLSep 22, 2023

Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models

Haoyu Gao, Ting-En Lin, Hangyu Li et al.

Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.

9.8SEApr 30

AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges

Haoyu Gao, Mansooreh Zahedi, Wenxin Jiang et al.

With the advancement of AI models, more software systems are adopting AI as a component to facilitate automation. Pre-trained models (PTMs) have become a cornerstone of AI-based software, allowing for rapid integration and development with lower training cost. However, their adoption also introduces failure modes such as data leakage and biased outputs, that may require careful handling by downstream developers. While previous research has proposed taxonomies of these technical concerns and various mitigation strategies, how downstream developers address these issues during the development of general AI-based software when reusing PTMs remains unexplored. Understanding downstream developers' perspectives is essential because they directly influence how these potential failures concerns translate into practice, such as determining whether immediate risks like data leakage or model bias are recognised, mitigated, or inadvertently overlooked in real-world deployments. This study investigates downstream developers' concerns, practices and perceived challenges regarding practical AI failures during the development of AI-based software. To achieve this, we conducted a mixed-method study, including interviews with 16 participants, a survey of 86 practitioners,

12.6SEMar 20, 2025

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

Hong Yi Lin, Chunhua Liu, Haoyu Gao et al.

State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, $\textbf{CodeReviewQA}$ that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into $\textbf{three essential reasoning steps}$: $\textit{change type recognition}$ (CTR), $\textit{change localisation}$ (CL), and $\textit{solution identification}$ (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on $\textbf{900 manually curated, high-quality examples}$ across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.

3.7CVMay 22, 2024

QGait: Toward Accurate Quantization for Gait Recognition

Senmao Tian, Haoyu Gao, Gangyi Hong et al.

Existing deep learning methods have made significant progress in gait recognition. Quantization can facilitate the application of gait models as a model-agnostic general compression technique. Typically, appearance-based models binarize inputs into silhouette sequences. However, mainstream quantization methods prioritize minimizing task loss over quantization error, which is detrimental to gait recognition with binarized inputs. To address this, we propose a differentiable soft quantizer, which better simulates the gradient of the round function during backpropagation. This enables the network to learn from subtle input perturbations. However, our theoretical analysis and empirical studies reveal that directly applying the soft quantizer can hinder network convergence. We addressed this issue by adopting a two-stage training strategy, introducing a soft quantizer during the fine-tuning phase. However, in the first stage of training, we observed a significant change in the output distribution of different samples in the feature space compared to the full-precision network. It is this change that led to a loss in performance. Based on this, we propose an Inter-class Distance-guided Calibration (IDC) strategy to preserve the relative distance between the embeddings of samples with different labels. Extensive experiments validate the effectiveness of our approach, demonstrating state-of-the-art accuracy across various settings and datasets.