90.1SEMay 6Code
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for CodeKaifeng He, Xiaojun Zhang, Peiliang Cai et al.
Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.
SYDec 28, 2017
Aircraft trajectory control with feedback linearization for general nonlinear systemSheng Zhang, Fei Liao, Yanqing Chen et al.
The feedback linearization method is further developed for the controller design on general nonlinear systems. Through the Lyapunov stability theory, the intractable nonlinear implicit algebraic control equations are effectively solved, and the asymptotically tracking performance is guaranteed. Moreover, it is proved that the controller may be used in an inverse-free version to the set-point control. With this method, a nonlinear aircraft outer-loop trajectory controller is developed. For the concern regarding the controller's robustness, the integral control technique is combined to counteract the adverse effect from modeling errors. Simulation results verify the well performance of the proposed controller.
65.1SEApr 24
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code GenerationKaifeng He, Mingwei Liu, Chong Wang et al.
Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at uncertain decision points that influence program logic. While standard strategies such as greedy decoding treat all tokens uniformly, they overlook code-specific uncertainty patterns, leading to suboptimal performance. This paper presents an empirical study revealing that many generation errors stem from token ranking mistakes at high-uncertainty steps, where the correct token is present but not top-ranked. Motivated by these findings, we propose AdaDec, a lookahead-based uncertainty-guided adaptive decoding framework that integrates a token-level pause-then-rerank mechanism driven by token uncertainty. AdaDec learns model-specific uncertainty thresholds and applies a lookahead-based reranking strategy when uncertainty is high. Experiments on HumanEval+, MBPP+, and DevEval benchmarks show that AdaDec improves Pass@1 accuracy by up to 20.9% in absolute terms over greedy decoding. More importantly, it consistently outperforms both competitive baselines like Beam Search and state-of-the-art adaptive decoding methods such as AdapT, while maintaining high efficiency through selective, uncertainty-triggered pausing. Our results highlight the promise of uncertainty-aware adaptive decoding for improving both the reliability and efficiency of LLM-based code generation.