CLMar 22, 2023Code
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and GenerationFengji Zhang, Bei Chen, Yue Zhang et al.
The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: https://github.com/microsoft/CodeT/tree/main/RepoCoder
SEAug 24, 2022
Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced TransformerFengji Zhang, Jin Liu, Yao Wan et al.
Stack Overflow is one of the most popular programming communities where developers can seek help for their encountered problems. Nevertheless, if inexperienced developers fail to describe their problems clearly, it is hard for them to attract sufficient attention and get the anticipated answers. We propose M$_3$NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets. Developers may use the generated titles to find closely related posts and complete their problem descriptions. M$_3$NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding and generation ability. To alleviate the ambiguity issue that the same code snippets could be aligned with different titles under varying contexts, we propose the maximal marginal multiple nucleus sampling strategy to generate multiple high-quality and diverse title candidates at a time for the developers to choose from. We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M$_3$NSCT5. The automatic evaluation results on the BLEU and ROUGE metrics demonstrate the superiority of M$_3$NSCT5 over six state-of-the-art baseline models. Moreover, a human evaluation with trustworthy results also demonstrates the great potential of our approach for real-world application.
CVSep 19, 2023
Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D SegmentationJingyu Zhang, Huitong Yang, Dai-Jie Wu et al.
Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.
CVJan 15Code
LaViT: Aligning Latent Visual Thoughts for Multi-modal ReasoningLinquan Wu, Tianxiang Jiang, Yifei Dong et al.
Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
SEDec 9, 2025Code
FedLAD: A Modular and Adaptive Testbed for Federated Log Anomaly DetectionYihan Liao, Jacky Keung, Zhenyu Mao et al.
Log-based anomaly detection (LAD) is critical for ensuring the reliability of large-scale distributed systems. However, most existing LAD approaches assume centralized training, which is often impractical due to privacy constraints and the decentralized nature of system logs. While federated learning (FL) offers a promising alternative, there is a lack of dedicated testbeds tailored to the needs of LAD in federated settings. To address this, we present FedLAD, a unified platform for training and evaluating LAD models under FL constraints. FedLAD supports plug-and-play integration of diverse LAD models, benchmark datasets, and aggregation strategies, while offering runtime support for validation logging (self-monitoring), parameter tuning (self-configuration), and adaptive strategy control (self-adaptation). By enabling reproducible and scalable experimentation, FedLAD bridges the gap between FL frameworks and LAD requirements, providing a solid foundation for future research. Project code is publicly available at: https://github.com/AA-cityu/FedLAD.
45.8SEApr 25
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution ShiftsJingyu Zhang, Fan Wang, Jacky Keung et al.
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
CVOct 16, 2024Code
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding TasksFengji Zhang, Linquan Wu, Huiyu Bai et al.
Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in coding contexts. We present HumanEval-V, a rigorous benchmark of human-annotated coding tasks that spans six task types and evaluates diverse visual reasoning capabilities. Each task features carefully crafted diagrams paired with function signatures and test cases, employing novel code generation tasks to thoroughly assess models' diagram comprehension. Through extensive experiments with 22 LMMs, we find that even top-performing models achieve modest success rates, with Claude 3.5 Sonnet reaching only 36.8% pass@1, highlighting substantial room for improvement. Our analysis reveals that current LMMs struggle with spatial transformations, topological relationships, and dynamic patterns that humans find intuitive. These findings provide valuable insights for advancing LMMs' visual reasoning abilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.
CLOct 9, 2025Code
A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement LearningFengji Zhang, Xinyao Niu, Chengyang Ying et al. · tsinghua
Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search
34.4SEApr 24
R2Code: A Self-Reflective LLM Framework for Requirements-to-Code TraceabilityYifei Wang, Jacky Keung, Xiaoxue Ma et al.
Accurate requirement-to-code traceability is crucial for software maintenance. However, existing IR- and embedding-based methods are heavily dependent on lexical similarity, often yielding incomplete or inconsistent links across projects and languages and incurring high cost from long-context retrieval and prompting. This paper presents R2Code, an LLM-based semantic traceability framework designed to improve trace link accuracy while reducing inference cost. R2Code integrates three components: 1) a decomposition-enhanced Bidirectional Alignment Network (BAN) that aligns four-layer requirement semantics with corresponding code structures to support cross-level semantic matching; 2) a Self-Reflective Consistency Verification (SRCV) module that conducts explanation-guided consistency checking to calibrate link reliability; and 3) a Dynamic Context-Adaptive Retrieval (DCAR) mechanism that adjusts retrieval granularity and filters contexts using semantic-overlap weighting for efficient context utilization. Experiments on five public datasets spanning multiple domains and two programming languages demonstrate that R2Code consistently outperforms the strongest baselines, achieving an average F1 gain of 7.4%, while reducing token consumption by up to 41.7% through adaptive context control.
33.7HCApr 27
Making Sense of Scams: Understanding Scam Conversations Through Multi-Level AlignmentZhenyu Mao, Jacky Keung, Xiangyu Li et al.
Online scams often unfold gradually through interaction, yet existing detection systems predominantly rely on snapshot-based signals and interruptive warnings, revealing two research gaps in the lack of signals that represent scam risk within conversational dynamics and the underexplored design of non-interruptive interaction. To address these gaps, we introduce multi-level alignment-based hints, informed by the Interactive Alignment Model, as a new detection signal for supporting sensemaking in scam-related conversations. These hints operationalize low-level lexical and syntactic alignments and high-level semantic and situation-model alignments between conversational participants, making conversational dynamics visible to users. We first conduct a preliminary evaluation on real-life scam dialogues, showing that as conversations approach scam attempts, low-level alignment scores remain stable while high-level alignment scores systematically decline, revealing a consistent cross-level pattern indicative of scam progression. Building on this insight, we conduct a user study with thirty participants, indicating that relative to the no-hint baseline, multi-level alignment-based hints increase precision by 0.25, recall by 0.16, and F1 score by 0.21, yielding substantially larger gains than the marginal improvements achieved by keyword-triggered alerts. Statistical analyses reveal that the proposed hints support earlier and more stable confidence formation over time, with ablation results further highlighting the effectiveness of combining alignment hints across levels in achieving these advantages.
74.4SEApr 7
An Empirical Study of Perceptions of General LLMs and Multimodal LLMs on Hugging FaceYujian Liu, Xiao Yu, Jacky Keung et al.
Large language models (LLMs) have rapidly evolved from general-purpose systems to multimodal models capable of processing text, images, and audio. As both general-purpose LLMs (GLLMs) and multimodal LLMs (MLLMs) gain widespread adoption, understanding user perceptions in real-world settings becomes increasingly important. However, existing studies often rely on surveys or platform-specific data (e.g., Reddit or GitHub issues), which either constrain user feedback through predefined questions or overemphasize failure-driven, debugging-oriented discussions, thus failing to capture diverse, experience-driven, and cross-model user perspectives in practice. To address this issue, we conduct an empirical study of user discussions on Hugging Face, a major model hub with diverse models and active communities. We collect and manually annotate 662 discussion threads from 38 representative models (21 GLLMs and 17 MLLMs), and develop a three-level taxonomy to systematically characterize user concerns. Our analysis reveals that LLM access barriers, generation quality, and deployment and invocation complexity are the most prominent concerns, alongside issues such as documentation limitations and resource constraints. Based on these findings, we derive actionable implications for improving LLM ecosystem.
CVJun 3, 2025
FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression RecognitionLinquan Wu, Tianxiang Jiang, Haoyu Yang et al.
Micro-expression recognition (MER) has achieved impressive accuracy in controlled laboratory settings. However, its real-world applicability faces a significant generalization cliff, severely hindering practical deployment due to poor performance on unseen data and susceptibility to domain shifts. Existing attention mechanisms often overfit to dataset-specific appearance cues or rely on fixed spatial priors, making them fragile in diverse environments. We posit that robust MER requires focusing on quasi-invariant motion orientations inherent to micro-expressions, rather than superficial pixel-level features. To this end, we introduce \textbf{FaceSleuth-R}, a framework centered on our novel \textbf{Single-Orientation Attention (SOA)} module. SOA is a lightweight, differentiable operator that enables the network to learn layer-specific optimal orientations, effectively guiding attention towards these robust motion cues. Through extensive experiments, we demonstrate that SOA consistently discovers a universal near-vertical motion prior across diverse datasets. More critically, FaceSleuth-R showcases superior generalization in rigorous Leave-One-Dataset-Out (LODO) protocols, significantly outperforming baselines and state-of-the-art methods when confronted with domain shifts. Furthermore, our approach establishes \textbf{state-of-the-art results} across several benchmarks. This work highlights adaptive orientation-aware attention as a key paradigm for developing truly generalized and high-performing MER systems.
CLSep 27, 2021
Improving Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal informationFengji Zhang, Xiao Yu, Jacky Keung et al.
Context: Stack Overflow is very helpful for software developers who are seeking answers to programming problems. Previous studies have shown that a growing number of questions are of low quality and thus obtain less attention from potential answerers. Gao et al. proposed an LSTM-based model (i.e., BiLSTM-CC) to automatically generate question titles from the code snippets to improve the question quality. However, only using the code snippets in the question body cannot provide sufficient information for title generation, and LSTMs cannot capture the long-range dependencies between tokens. Objective: This paper proposes CCBERT, a deep learning based novel model to enhance the performance of question title generation by making full use of the bi-modal information of the entire question body. Method: CCBERT follows the encoder-decoder paradigm and uses CodeBERT to encode the question body into hidden representations, a stacked Transformer decoder to generate predicted tokens, and an additional copy attention layer to refine the output distribution. Both the encoder and decoder perform the multi-head self-attention operation to better capture the long-range dependencies. This paper builds a dataset containing around 200,000 high-quality questions filtered from the data officially published by Stack Overflow to verify the effectiveness of the CCBERT model. Results: CCBERT outperforms all the baseline models on the dataset. Experiments on both code-only and low-resource datasets show the superiority of CCBERT with less performance degradation. The human evaluation also shows the excellent performance of CCBERT concerning both readability and correlation criteria.
SEMay 29, 2021
Investigating the Significance of Bellwether Effect to Improve Software Effort EstimationSolomon Mensah, Jacky Keung, Stephen G. MacDonell et al.
Bellwether effect refers to the existence of exemplary projects (called the Bellwether) within a historical dataset to be used for improved prediction performance. Recent studies have shown an implicit assumption of using recently completed projects (referred to as moving window) for improved prediction accuracy. In this paper, we investigate the Bellwether effect on software effort estimation accuracy using moving windows. The existence of the Bellwether was empirically proven based on six postulations. We apply statistical stratification and Markov chain methodology to select the Bellwether moving window. The resulting Bellwether moving window is used to predict the software effort of a new project. Empirical results show that Bellwether effect exist in chronological datasets with a set of exemplary and recently completed projects representing the Bellwether moving window. Result from this study has shown that the use of Bellwether moving window with the Gaussian weighting function significantly improve the prediction accuracy.
SEMay 16, 2021
Investigating the Significance of the Bellwether Effect to Improve Software Effort Prediction: Further Empirical StudySolomon Mensah, Jacky Keung, Stephen G. MacDonell et al.
Context: In addressing how best to estimate how much effort is required to develop software, a recent study found that using exemplary and recently completed projects [forming Bellwether moving windows (BMW)] in software effort prediction (SEP) models leads to relatively improved accuracy. More studies need to be conducted to determine whether the BMW yields improved accuracy in general, since different sizing and aging parameters of the BMW are known to affect accuracy. Objective: To investigate the existence of exemplary projects (Bellwethers) with defined window size and age parameters, and whether their use in SEP improves prediction accuracy. Method: We empirically investigate the moving window assumption based on the theory that the prediction outcome of a future event depends on the outcomes of prior events. Sampling of Bellwethers was undertaken using three introduced Bellwether methods (SSPM, SysSam, and RandSam). The ergodic Markov chain was used to determine the stationarity of the Bell-wethers. Results: Empirical results show that 1) Bellwethers exist in SEP and 2) the BMW has an approximate size of 50 to 80 exemplary projects that should not be more than 2 years old relative to the new projects to be estimated. Conclusion: The study's results add further weight to the recommended use of Bellwethers for improved prediction accuracy in SEP.
SEMar 21, 2021
A Systematical Study on Application Performance Management Libraries for AppsYutian Tang, Haoyu Wang, Xian Zhan et al.
Being able to automatically detect the performance issues in apps can significantly improve apps' quality as well as having a positive influence on user satisfaction. Application Performance Management (APM) libraries are used to locate the apps' performance bottleneck, monitor their behaviors at runtime, and identify potential security risks. Although app developers have been exploiting application performance management (APM) tools to capture these potential performance issues, most of them do not fully understand the internals of these APM tools and the effect on their apps. To fill this gap, in this paper, we conduct the first systematic study on APMs for apps by scrutinizing 25 widely-used APMs for Android apps and develop a framework named APMHunter for exploring the usage of APMs in Android apps. Using APMHunter, we conduct a large-scale empirical study on 500,000 Android apps to explore the usage patterns of APMs and discover the potential misuses of APMs. We obtain two major findings: 1) some APMs still employ deprecated permissions and approaches, which makes APMs fail to perform as expected; 2) inappropriate use of APMs can cause privacy leaks. Thus, our study suggests that both APM vendors and developers should design and use APMs scrupulously.
SEMar 12, 2021
A Multi-Modal Transformer-based Code Summarization Approach for Smart ContractsZhen Yang, Jacky Keung, Xiao Yu et al.
Code comment has been an important part of computer programs, greatly facilitating the understanding and maintenance of source code. However, high-quality code comments are often unavailable in smart contracts, the increasingly popular programs that run on the blockchain. In this paper, we propose a Multi-Modal Transformer-based (MMTrans) code summarization approach for smart contracts. Specifically, the MMTrans learns the representation of source code from the two heterogeneous modalities of the Abstract Syntax Tree (AST), i.e., Structure-based Traversal (SBT) sequences and graphs. The SBT sequence provides the global semantic information of AST, while the graph convolution focuses on the local details. The MMTrans uses two encoders to extract both global and local semantic information from the two modalities respectively, and then uses a joint decoder to generate code comments. Both the encoders and the decoder employ the multi-head attention structure of the Transformer to enhance the ability to capture the long-range dependencies between code tokens. We build a dataset with over 300K <method, comment> pairs of smart contracts, and evaluate the MMTrans on it. The experimental results demonstrate that the MMTrans outperforms the state-of-the-art baselines in terms of four evaluation metrics by a substantial margin, and can generate higher quality comments.
SEFeb 9, 2013
Effort-Oriented Classification Matrix of Web Service CompositionZheng Li, Liam O'Brien, Jacky Keung et al.
Within the service-oriented computing domain, Web service composition is an effective realization to satisfy the rapidly changing requirements of business. Therefore, the research into Web service composition has unfolded broadly. Since examining all of the related work in this area becomes a mission next to impossible, the classification of composition approaches can be used to facilitate multiple research tasks. However, the current attempts to classify Web service composition do not have clear objectives. Furthermore, the contexts and technologies of composition approaches are confused in the existing classifications. This paper proposes an effort-oriented classification matrix for Web service composition, which distinguishes between the context and technology dimension. The context dimension is aimed at analyzing the environment influence on the effort of Web service composition, while the technology dimension focuses on the technique influence on the effort. Consequently, besides the traditional classification benefits, this matrix can be used to build the basis of cost estimation for Web service composition in future research.
SEFeb 9, 2013
Software Cost Estimation Framework for Service-Oriented Architecture Systems using Divide-and-Conquer ApproachZheng Li, Jacky Keung
Due to the complexity of Service-Oriented Architecture (SOA), cost and effort estimation for SOA-based software development is more difficult than that for traditional software development. Unfortunately, there is a lack of published work about cost and effort estimation for SOA-based software. Existing cost estimation approaches are inadequate to address the complex service-oriented systems. This paper proposes a novel framework based on Divide-and-Conquer (D&C) for cost estimation for building SOA-based software. By dealing with separately development parts, the D&C framework can help organizations simplify and regulate SOA implementation cost estimation. Furthermore, both cost estimation modeling and software sizing work can be satisfied respectively by switching the corresponding metrics within this framework. Given the requirement of developing these metrics, this framework also defines the future research in four different directions according to the separate cost estimation sub-problems.