Hongyu Kuang

SE
h-index11
6papers
76citations
Novelty45%
AI Score48

6 Papers

SEJun 1
Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation

Kexin Sun, Yucong Guan, Jiaqi Sun et al.

LLMs have shown strong potential for automating code review, yet their practical utility depends heavily on the design of generation and context strategies. In this paper, we investigate how to improve LLM-based code review through generation strategy and contextual augmentation. We first propose an issue-list review paradigm, in which LLMs enumerate all potential issues rather than reporting only the single most important one (i.e., primary-issue review). We then systematically compare three types of code context augmentation -- neighboring, LSP-based semantics, and IR-based similar co-change context -- and study how they influence issue discovery. Finally, we integrate candidates from no-context and context-enhanced generation to improve review coverage, and introduce refinement-guided pruning to keep the candidate list at a practical size. We evaluate our approach on 1,438 Go review instances using downstream code refinement as the main metric, i.e., how often the candidate list contains at least one comment inducing the same code change as the final human revision. For comparison, we evaluate comments by CodeReviewer, a model trained specifically for review comment generation, as well as ground-truth human review comments (as a practical upper bound), under the same refinement-based evaluation. The results show that our best configuration, combining issue-list review, neighboring and similar co-change context, and candidate integration, reaches 28.00% refinement exact match, a statistically significant gain of +10.85 percentage points over primary-issue review without any additional context (17.15%), substantially outperforming CodeReviewer (15.02%) and approaching the human-oracle ceiling of 36.09%. Our refinement-guided pruning reduces the average candidate count from 7.2 to 3.1 at top-5 while retaining nearly the full benefit, making the candidate list easier to inspect.

SEJan 27Code
AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

Lei Zhang, Yongda Yu, Minghui Yu et al.

High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285\% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr-bench .

SEApr 25
Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

Kexin Sun, Hongyu Kuang, Sebastian Baltes et al.

AI-based code review tools automatically review and comment on pull requests to improve code quality. Despite their growing presence, little is known about their actual impact. We present a large-scale empirical study of 16 popular AI-based code review actions for GitHub workflows, analyzing more than 22,000 review comments in 178 repositories. We investigate (1) how these tools are adopted and configured, (2) whether their comments lead to code changes, and (3) which factors influence their effectiveness. We develop a two-stage LLM-assisted framework to determine whether review comments are addressed, and use interpretable machine learning to identify influencing factors. Our findings show that, while adoption is growing, effectiveness varies widely. Comments that are concise, contain code snippets, and are manually triggered, particularly those from hunk-level review tools, are more likely to result in code changes. These results highlight the importance of careful tool design and suggest directions for improving AI-based code review systems.

CRAug 12, 2024
Audit-LLM: Multi-Agent Collaboration for Log-based Insider Threat Detection

Chengyu Song, Linru Ma, Jianming Zheng et al.

Log-based insider threat detection (ITD) detects malicious user activities by auditing log entries. Recently, large language models (LLMs) with strong common sense knowledge have emerged in the domain of ITD. Nevertheless, diverse activity types and overlong log files pose a significant challenge for LLMs in directly discerning malicious ones within myriads of normal activities. Furthermore, the faithfulness hallucination issue from LLMs aggravates its application difficulty in ITD, as the generated conclusion may not align with user commands and activity context. In response to these challenges, we introduce Audit-LLM, a multi-agent log-based insider threat detection framework comprising three collaborative agents: (i) the Decomposer agent, breaking down the complex ITD task into manageable sub-tasks using Chain-of-Thought (COT) reasoning;(ii) the Tool Builder agent, creating reusable tools for sub-tasks to overcome context length limitations in LLMs; and (iii) the Executor agent, generating the final detection conclusion by invoking constructed tools. To enhance conclusion accuracy, we propose a pair-wise Evidence-based Multi-agent Debate (EMAD) mechanism, where two independent Executors iteratively refine their conclusions through reasoning exchange to reach a consensus. Comprehensive experiments conducted on three publicly available ITD datasets-CERT r4.2, CERT r5.2, and PicoDomain-demonstrate the superiority of our method over existing baselines and show that the proposed EMAD significantly improves the faithfulness of explanations generated by LLMs.

SEMar 24, 2021
Exploiting the Unique Expression for Improved Sentiment Analysis in Software Engineering Text

Kexin Sun, Hui Gao, Hongyu Kuang et al.

Sentiment analysis on software engineering (SE) texts has been widely used in the SE research, such as evaluating app reviews or analyzing developers sentiments in commit messages. To better support the use of automated sentiment analysis for SE tasks, researchers built an SE-domain-specified sentiment dictionary to further improve the accuracy of the results. Unfortunately, recent work reported that current mainstream tools for sentiment analysis still cannot provide reliable results when analyzing the sentiments in SE texts. We suggest that the reason for this situation is because the way of expressing sentiments in SE texts is largely different from the way in social network or movie comments. In this paper, we propose to improve sentiment analysis in SE texts by using sentence structures, a different perspective from building a domain dictionary. Specifically, we use sentence structures to first identify whether the author is expressing her sentiment in a given clause of an SE text, and to further adjust the calculation of sentiments which are confirmed in the clause. An empirical evaluation based on four different datasets shows that our approach can outperform two dictionary-based baseline approaches, and is more generalizable compared to a learning-based baseline approach.

AIDec 2, 2018
Macro action selection with deep reinforcement learning in StarCraft

Sijia Xu, Hongyu Kuang, Zhi Zhuang et al.

StarCraft (SC) is one of the most popular and successful Real Time Strategy (RTS) games. In recent years, SC is also widely accepted as a challenging testbed for AI research because of its enormous state space, partially observed information, multi-agent collaboration, and so on. With the help of annual AIIDE and CIG competitions, a growing number of SC bots are proposed and continuously improved. However, a large gap remains between the top-level bot and the professional human player. One vital reason is that current SC bots mainly rely on predefined rules to select macro actions during their games. These rules are not scalable and efficient enough to cope with the enormous yet partially observed state space in the game. In this paper, we propose a deep reinforcement learning (DRL) framework to improve the selection of macro actions. Our framework is based on the combination of the Ape-X DQN and the Long-Short-Term-Memory (LSTM). We use this framework to build our bot, named as LastOrder. Our evaluation, based on training against all bots from the AIIDE 2017 StarCraft AI competition set, shows that LastOrder achieves an 83% winning rate, outperforming 26 bots in total 28 entrants.