Beijun Shen

SE
h-index8
14papers
139citations
Novelty51%
AI Score56

14 Papers

94.0SEMay 27Code
Rethinking Code Complexity Through the Lens of Large Language Models

Chen Xie, Xiaodong Gu, Yuling Shi et al.

Code complexity metrics such as cyclomatic complexity have long been used to assess software quality and maintainability. With the rapid advancement of large language models (LLMs) on coding tasks, an important yet underexplored question arises: do traditional complexity metrics meaningfully characterize the coding difficulty that LLMs perceive? In this work, we empirically demonstrate that classical complexity metrics exhibit no consistent correlation with LLM performance, revealing a fundamental mismatch with model-perceived difficulty. To address this gap, we propose LM-CC, a novel code complexity metric tailored for LLMs, grounded in the hypothesis that model-perceived code difficulty is fundamentally driven by semantic nonlinearity. LM-CC quantifies complexity through an entropy-guided semantic compositional hierarchy, capturing the cumulative uncertainty encountered by LLMs during code understanding. Our experimental results demonstrate that LM-CC exhibits strong and consistent partial correlations with LLM performance, while semantics-preserving reductions in LM-CC consistently lead to improved downstream task performance. The source code is available at: https://github.com/xchen121/lm-cc.

AIAug 8, 2023
InfeRE: Step-by-Step Regex Generation via Chain of Inference

Shuai Zhang, Xiaodong Gu, Yuting Chen et al.

Automatically generating regular expressions (abbrev. regexes) from natural language description (NL2RE) has been an emerging research area. Prior studies treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of regexes into chains of step-by-step inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX. Experimental results show that InfeRE substantially outperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5 accuracy on two datasets, respectively. Particularly, InfeRE outperforms the popular tree-based generation approach by 18.1% and 11.3% on both datasets, respectively, in terms of DFA@5 accuracy.

AIJan 8
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Wenhao Zeng, Xuteng Zhang, Yuling Shi et al.

Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

SEJan 1
In Line with Context: Repository-Level Code Generation via Context Inlining

Chao Hu, Wenhao Zeng, Yuling Shi et al.

Repository-level code generation has attracted growing attention in recent years. Unlike function-level code generation, it requires the model to understand the entire repository, reasoning over complex dependencies across functions, classes, and modules. However, existing approaches such as retrieval-augmented generation (RAG) or context-based function selection often fall short: they primarily rely on surface-level similarity and struggle to capture the rich dependencies that govern repository-level semantics. In this paper, we introduce InlineCoder, a novel framework for repository-level code generation. InlineCoder enhances the understanding of repository context by inlining the unfinished function into its call graph, thereby reframing the challenging repository understanding as an easier function-level coding task. Given a function signature, InlineCoder first generates a draft completion, termed an anchor, which approximates downstream dependencies and enables perplexity-based confidence estimation. This anchor drives a bidirectional inlining process: (i) Upstream Inlining, which embeds the anchor into its callers to capture diverse usage scenarios; and (ii) Downstream Retrieval, which integrates the anchor's callees into the prompt to provide precise dependency context. The enriched context, combining draft completion with upstream and downstream perspectives, equips the LLM with a comprehensive repository view.

SEDec 23, 2025
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?

Zhe Yin, Xiaodong Gu, Beijun Shen

Code language models excel on code intelligence tasks, yet their internal interpretability is underexplored. Existing neuron interpretability techniques from NLP are suboptimal for source code due to programming languages formal, hierarchical, and executable nature. We empirically investigate code LLMs at the neuron level, localizing language-specific neurons (selectively responsive to one language) and concept layers (feed-forward layers encoding language-agnostic code representations). We analyze Llama-3.1-8B and Qwen2.5-Coder-32B on multilingual inputs in C++, Java, Python, Go, and JavaScript, measuring neuron selectivity and layerwise contributions during generation. We find (1) neurons specialized for individual languages alongside a universal subset supporting general-purpose generation; and (2) lower layers mainly encode language-specific syntax, while middle layers capture semantic abstractions shared across languages, emerging as concept layers. We demonstrate utility on three tasks: neuron-guided fine-tuning for code generation, clone detection via concept-layer embeddings, and concept-layer-guided transfer for code summarization, each yielding consistent gains in multilingual settings.

SEMar 19, 2021Code
Locating Faulty Methods with a Mixed RNN and Attention Model

Shouliang Yang, Junming Cao, Hushuang Zeng et al.

IR-based fault localization approaches achieves promising results when locating faulty files by comparing a bug report with source code. Unfortunately, they become less effective to locate faulty methods. We conduct a preliminary study to explore its challenges, and identify three problems: the semantic gap problem, the representation sparseness problem, and the single revision problem. To tackle these problems, we propose MRAM, a mixed RNN and attention model, which combines bug-fixing features and method structured features to explore both implicit and explicit relevance between methods and bug reports for method level fault localization task. The core ideas of our model are: (1) constructing code revision graphs from code, commits and past bug reports, which reveal the latent relations among methods to augment short methods and as well provide all revisions of code and past fixes to train more accurate models; (2) embedding three method structured features (token sequences, API invocation sequences, and comments) jointly with RNN and soft attention to represent source methods and obtain their implicit relevance with bug reports; and (3) integrating multirevision bug-fixing features, which provide the explicit relevance between bug reports and methods, to improve the performance. We have implemented MRAM and conducted a controlled experiment on five open-source projects. Comparing with stateof-the-art approaches, our MRAM improves MRR values by 3.8- 5.1% (3.7-5.4%) when the dataset contains (does not contain) localized bug reports. Our statistics test shows that our improvements are significant

CLSep 18, 2025
SWE-QA: Can Language Models Answer Repository-level Code Questions?

Weihan Peng, Yuling Shi, Yuhang Wang et al.

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

SEOct 15, 2024
Just-In-Time Software Defect Prediction via Bi-modal Change Representation Learning

Yuze Jiang, Beijun Shen, Xiaodong Gu

For predicting software defects at an early stage, researchers have proposed just-in-time defect prediction (JIT-DP) to identify potential defects in code commits. The prevailing approaches train models to represent code changes in history commits and utilize the learned representations to predict the presence of defects in the latest commit. However, existing models merely learn editions in source code, without considering the natural language intentions behind the changes. This limitation hinders their ability to capture deeper semantics. To address this, we introduce a novel bi-modal change pre-training model called BiCC-BERT. BiCC-BERT is pre-trained on a code change corpus to learn bi-modal semantic representations. To incorporate commit messages from the corpus, we design a novel pre-training objective called Replaced Message Identification (RMI), which learns the semantic association between commit messages and code changes. Subsequently, we integrate BiCC-BERT into JIT-DP and propose a new defect prediction approach -- JIT-BiCC. By leveraging the bi-modal representations from BiCC-BERT, JIT-BiCC captures more profound change semantics. We train JIT-BiCC using 27,391 code changes and compare its performance with 8 state-of-the-art JIT-DP approaches. The results demonstrate that JIT-BiCC outperforms all baselines, achieving a 10.8% improvement in F1-score. This highlights its effectiveness in learning the bi-modal semantics for JIT-DP.

CLOct 1, 2025
LongCodeZip: Compress Long Context for Code Language Models

Yuling Shi, Yichun Qian, Hongyu Zhang et al.

Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.

SEApr 21, 2025
Empowering AI to Generate Better AI Code: Guided Generation of Deep Learning Projects with LLMs

Chen Xie, Mingsheng Jiao, Xiaodong Gu et al.

While large language models (LLMs) have been widely applied to code generation, they struggle with generating entire deep learning projects, which are characterized by complex structures, longer functions, and stronger reliance on domain knowledge than general-purpose code. An open-domain LLM often lacks coherent contextual guidance and domain expertise for specific projects, making it challenging to produce complete code that fully meets user requirements. In this paper, we propose a novel planning-guided code generation method, DLCodeGen, tailored for generating deep learning projects. DLCodeGen predicts a structured solution plan, offering global guidance for LLMs to generate the project. The generated plan is then leveraged to retrieve semantically analogous code samples and subsequently abstract a code template. To effectively integrate these multiple retrieval-augmented techniques, a comparative learning mechanism is designed to generate the final code. We validate the effectiveness of our approach on a dataset we build for deep learning code generation. Experimental results demonstrate that DLCodeGen outperforms other baselines, achieving improvements of 9.7% in CodeBLEU and 3.6% in human evaluation metrics.

CLAug 20, 2025
Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Guangzhan Wang, Hongyu Zhang, Beijun Shen et al.

Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their "knowledge emergence" capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.

CLApr 25, 2025
Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

Xuan Li, Zhe Yin, Xiaodong Gu et al.

With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing privacy and sensitive data to the cloud LLMs. Traditional techniques like homomorphic encryption, secure multi-party computation, and federated learning face challenges due to heavy computational costs and user participation requirements, limiting their applicability in LLM scenarios. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is trained to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.

SEJan 1, 2022
Cross-Domain Deep Code Search with Meta Learning

Yitian Chai, Hongyu Zhang, Beijun Shen et al.

Recently, pre-trained programming language models such as CodeBERT have demonstrated substantial gains in code search. Despite showing great performance, they rely on the availability of large amounts of parallel data to fine-tune the semantic mappings between queries and code. This restricts their practicality in domain-specific languages with relatively scarce and expensive data. In this paper, we propose CroCS, a novel approach for domain-specific code search. CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages (such as Java and Python) and is further adapted to domain-specific languages such as SQL and Solidity. Unlike cross-language CodeBERT, which is directly fine-tuned in the target language, CroCS adapts a few-shot meta-learning algorithm called MAML to learn the good initialization of model parameters, which can be best reused in a domain-specific language. We evaluate the proposed approach on two domain-specific languages, namely, SQL and Solidity, with model transferred from two widely used languages (Python and Java). Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.

SEJun 5, 2021
An Empirical Study on Tensor Shape Faults in Deep Learning Systems

Dangwei Wu, Beijun Shen, Yuting Chen

Software developers frequently adopt deep learning (DL) libraries to incorporate learning solutions into software systems. However, misuses of these libraries can cause various DL faults. Among them, tensor shape faults are most prevalent. Tensor shape faults occur when restriction conditions of operations are not met, leading to many system crashes. To support efficient detection and fixing of these faults, we conduct an empirical study to obtain a deep insight. We construct SFData, a set of 146 buggy programs with crashing tensor shape faults (i.e., those causing programs to crash). By analyzing the faults in SFData, we categorize them into four types and get some valuable observations.