Yuanliang Zhang

SE
h-index13
7papers
211citations
Novelty49%
AI Score35

7 Papers

CLSep 28, 2023Code
At Which Training Stage Does Code Data Help LLMs Reasoning?

Yingwei Ma, Yue Liu, Yue Yu et al.

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage introducing code data can really help LLMs reasoning. To this end, this paper systematically explores the impact of code data on LLMs at different stages. Concretely, we introduce the code data at the pre-training stage, instruction-tuning stage, and both of them, respectively. Then, the reasoning capability of LLMs is comprehensively and fairly evaluated via six reasoning tasks in five domains. We critically analyze the experimental results and provide conclusions with insights. First, pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability almost without negative transfer on other tasks. Besides, at the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability. Moreover, the dynamic mixing strategy of code and text data assists LLMs to learn reasoning capability step-by-step during training. These insights deepen the understanding of LLMs regarding reasoning ability for their application, such as scientific question answering, legal support, etc. The source code and model parameters are released at the link:~\url{https://github.com/yingweima2022/CodeLLM}.

CLOct 16, 2023Code
Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

Yingwei Ma, Yue Yu, Shanshan Li et al.

Large language models (LLMs) have showcased remarkable prowess in code generation. However, automated code generation is still challenging since it requires a high-level semantic mapping between natural language requirements and codes. Most existing LLMs-based approaches for code generation rely on decoder-only causal language models often treate codes merely as plain text tokens, i.e., feeding the requirements as a prompt input, and outputing code as flat sequence of tokens, potentially missing the rich semantic features inherent in source code. To bridge this gap, this paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. Our motivation is that the semantic information of the source code (\eg data flow and control flow) describes more precise program execution behavior, intention and function. By guiding LLM consider and integrate semantic information, we can achieve a more granular understanding and representation of code, enhancing code generation accuracy. Meanwhile, while traditional techniques leveraging such semantic information require complex static or dynamic code analysis to obtain features such as data flow and control flow, SeCoT demonstrates that this process can be fully automated via the intrinsic capabilities of LLMs (i.e., in-context learning), while being generalizable and applicable to challenging domains. While SeCoT can be applied with different LLMs, this paper focuses on the powerful GPT-style models: ChatGPT(close-source model) and WizardCoder(open-source model). The experimental study on three popular DL benchmarks (i.e., HumanEval, HumanEval-ET and MBPP) shows that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.

IRAug 24, 2022
Scenario-Adaptive and Self-Supervised Model for Multi-Scenario Personalized Recommendation

Yuanliang Zhang, Xiaofeng Wang, Jinxin Hu et al.

Multi-scenario recommendation is dedicated to retrieve relevant items for users in multiple scenarios, which is ubiquitous in industrial recommendation systems. These scenarios enjoy portions of overlaps in users and items, while the distribution of different scenarios is different. The key point of multi-scenario modeling is to efficiently maximize the use of whole-scenario information and granularly generate adaptive representations both for users and items among multiple scenarios. we summarize three practical challenges which are not well solved for multi-scenario modeling: (1) Lacking of fine-grained and decoupled information transfer controls among multiple scenarios. (2) Insufficient exploitation of entire space samples. (3) Item's multi-scenario representation disentanglement problem. In this paper, we propose a Scenario-Adaptive and Self-Supervised (SASS) model to solve the three challenges mentioned above. Specifically, we design a Multi-Layer Scenario Adaptive Transfer (ML-SAT) module with scenario-adaptive gate units to select and fuse effective transfer information from whole scenario to individual scenario in a quite fine-grained and decoupled way. To sufficiently exploit the power of entire space samples, a two-stage training process including pre-training and fine-tune is introduced. The pre-training stage is based on a scenario-supervised contrastive learning task with the training samples drawn from labeled and unlabeled data spaces. The model is created symmetrically both in user side and item side, so that we can get distinguishing representations of items in different scenarios. Extensive experimental results on public and industrial datasets demonstrate the superiority of the SASS model over state-of-the-art methods. This model also achieves more than 8.0% improvement on Average Watching Time Per User in online A/B tests.

SEMar 22, 2021Code
ConfInLog: Leveraging Software Logs to Infer Configuration Constraints

Shulin Zhou, Xiaodong Liu, Shanshan Li et al.

Misconfigurations have become the dominant causes of software failures in recent years, drawing tremendous attention for their increasing prevalence and severity. Configuration constraints can preemptively avoid misconfiguration by defining the conditions that configuration options should satisfy. Documentation is the main source of configuration constraints, but it might be incomplete or inconsistent with the source code. In this regard, prior researches have focused on obtaining configuration constraints from software source code through static analysis. However, the difficulty in pointer analysis and context comprehension prevents them from collecting accurate and comprehensive constraints. In this paper, we observed that software logs often contain configuration constraints. We conducted an empirical study and summarized patterns of configuration-related log messages. Guided by the study, we designed and implemented ConfInLog, a static tool to infer configuration constraints from log messages. ConfInLog first selects configuration-related log messages from source code by using the summarized patterns, then infers constraints from log messages based on the summarized natural language patterns. To evaluate the effectiveness of ConfInLog, we applied our tool on seven popular open-source software systems. ConfInLog successfully inferred 22 to 163 constraints, in which 59.5% to 61.6% could not be inferred by the state-of-the-art work. Finally, we submitted 67 documentation patches regarding the constraints inferred by ConfInLog. The constraints in 29 patches have been confirmed by the developers, among which 10 patches have been accepted.

SEFeb 14, 2021Code
An Evolutionary Study of Configuration Design and Implementation in Cloud Systems

Yuanliang Zhang, Haochen He, Owolabi Legunsen et al.

Many techniques were proposed for detecting software misconfigurations in cloud systems and for diagnosing unintended behavior caused by such misconfigurations. Detection and diagnosis are steps in the right direction: misconfigurations cause many costly failures and severe performance issues. But, we argue that continued focus on detection and diagnosis is symptomatic of a more serious problem: configuration design and implementation are not yet first-class software engineering endeavors in cloud systems. Little is known about how and why developers evolve configuration design and implementation, and the challenges that they face in doing so. This paper presents a source-code level study of the evolution of configuration design and implementation in cloud systems. Our goal is to understand the rationale and developer practices for revising initial configuration design/implementation decisions, especially in response to consequences of misconfigurations. To this end, we studied 1178 configuration-related commits from a 2.5 year version-control history of four large-scale, actively-maintained open-source cloud systems (HDFS, HBase, Spark, and Cassandra). We derive new insights into the software configuration engineering process. Our results motivate new techniques for proactively reducing misconfigurations by improving the configuration design and implementation process in cloud systems. We highlight a number of future research directions.

SEDec 11, 2024
Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Yuanliang Zhang, Yifan Xie, Shanshan Li et al.

Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFU- SEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%.

SESep 27, 2021
Clone-based code method usage pattern mining

Zhipeng Xue, Yuanliang Zhang, Rulin Xu

When programmers retrieve a code method and want to reuse it, they need to understand the usage patterns of the retrieved method. However, it is difficult to obtain usage information of the retrieved method since this method may only have a brief comment and few available usage examples. In this paper, we propose an approach, called LUPIN (cLone-based Usage Pattern mIniNg), to mine the usage patterns of these methods, which do not widely appeared in the code repository. The key idea of LUPIN is that the cloned code of the target method may have a similar usage pattern, and we can collect more usage information of the target method from cloned code usage examples. From the amplified usage examples, we mine the usage pattern of the target method by frequent subsequence mining after program slicing and code normalization. Our evaluation shows that LUPIN can mine four categories of usage patterns with an average precision of 0.65.