SEOct 10, 2023Code
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language ModelPeng Di, Jianguo Li, Hang Yu et al.
Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.
SEFeb 22, 2024
REPOFUSE: Repository-Level Code Completion with Fused Dual ContextMing Liang, Xiaoheng Xie, Gehao Zhang et al.
The success of language models in code assistance has spurred the proposal of repository-level code completion as a means to enhance prediction accuracy, utilizing the context from the entire codebase. However, this amplified context can inadvertently increase inference latency, potentially undermining the developer experience and deterring tool adoption - a challenge we termed the Context-Latency Conundrum. This paper introduces REPOFUSE, a pioneering solution designed to enhance repository-level code completion without the latency trade-off. REPOFUSE uniquely fuses two types of context: the analogy context, rooted in code analogies, and the rationale context, which encompasses in-depth semantic relationships. We propose a novel rank truncated generation (RTG) technique that efficiently condenses these contexts into prompts with restricted size. This enables REPOFUSE to deliver precise code completions while maintaining inference efficiency. Through testing with the CrossCodeEval suite, REPOFUSE has demonstrated a significant leap over existing models, achieving a 40.90% to 59.75% increase in exact match (EM) accuracy for code completions and a 26.8% enhancement in inference speed. Beyond experimental validation, REPOFUSE has been integrated into the workflow of a large enterprise, where it actively supports various coding tasks.
SEDec 14, 2019
Conquering the Extensional Scalability Problem for Value-Flow Analysis FrameworksQingkai Shi, Rongxin Wu, Gang Fan et al.
With an increasing number of value-flow properties to check, existing static program analysis still tends to have scalability issues when high precision is required. We observe that the key design flaw behind the scalability problem is that the core static analysis engine is oblivious of the mutual synergies among different properties being checked and, thus, inevitably loses many optimization opportunities. Our approach is inter-property-aware and able to capture possible overlaps and inconsistencies among different properties. Thus, before analyzing a program, we can make optimization plans which decide how to reuse the specific analysis results of a property to speed up checking other properties. Such a synergistic interaction among the properties significantly improves the analysis performance. We have evaluated our approach by checking twenty value-flow properties in standard benchmark programs and ten real-world software systems. The results demonstrate that our approach is more than 8x faster than existing ones but consumes only 1/7 memory. Such a substantial improvement in analysis efficiency is not achieved by sacrificing the effectiveness: at the time of writing, 39 bugs found by our approach have been fixed by developers and four of them have been assigned CVE IDs due to their security impact.