Xiaojun Zhang

CL
6papers
88citations
Novelty41%
AI Score40

6 Papers

SEMay 6Code
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Kaifeng He, Xiaojun Zhang, Peiliang Cai et al.

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.

CVJun 29, 2023
Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method

Jiahao Qin, Yitao Xu, Zong Lu et al.

In the realm of multimodal data integration, feature alignment plays a pivotal role. This paper introduces an innovative approach to feature alignment that revolutionizes the fusion of multimodal information. Our method employs a novel iterative process of telescopic displacement and expansion of feature representations across different modalities, culminating in a coherent unified representation within a shared feature space. This sophisticated technique demonstrates a remarkable ability to capture and leverage complex crossmodal interactions at the highest levels of abstraction. As a result, we observe significant enhancements in the performance of multimodal learning tasks. Through rigorous comparative analysis, we establish the superiority of our approach over existing multimodal fusion paradigms across a diverse array of applications. Comprehensive empirical evaluations conducted on multifaceted datasets encompassing temporal sequences, visual data, and textual information provide compelling evidence that our method achieves unprecedented benchmarks in the field. This work not only advances the state of the art in multimodal learning but also opens new avenues for exploring the synergies between disparate data modalities in complex analytical scenarios.

CLJan 18, 2024
Gradable ChatGPT Translation Evaluation

Hui Jiao, Bei Peng, Lu Zong et al.

ChatGPT, as a language model based on large-scale pre-training, has exerted a profound influence on the domain of machine translation. In ChatGPT, a "Prompt" refers to a segment of text or instruction employed to steer the model towards generating a specific category of response. The design of the translation prompt emerges as a key aspect that can wield influence over factors such as the style, precision and accuracy of the translation to a certain extent. However, there is a lack of a common standard and methodology on how to design and select a translation prompt. Accordingly, this paper proposes a generic taxonomy, which defines gradable translation prompts in terms of expression type, translation style, POS information and explicit statement, thus facilitating the construction of prompts endowed with distinct attributes tailored for various translation tasks. Specific experiments and cases are selected to validate and illustrate the effectiveness of the method.

CLNov 12, 2016
Semi-automatic Simultaneous Interpreting Quality Evaluation

Xiaojun Zhang

Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.

CLMay 22, 2016
Automatic Construction of Discourse Corpora for Dialogue Translation

Longyue Wang, Xiaojun Zhang, Zhaopeng Tu et al.

In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

CLApr 21, 2016
A Novel Approach to Dropped Pronoun Translation

Longyue Wang, Zhaopeng Tu, Xiaojun Zhang et al.

Dropped Pronouns (DP) in which pronouns are frequently dropped in the source language but should be retained in the target language are challenge in machine translation. In response to this problem, we propose a semi-supervised approach to recall possibly missing pronouns in the translation. Firstly, we build training data for DP generation in which the DPs are automatically labelled according to the alignment information from a parallel corpus. Secondly, we build a deep learning-based DP generator for input sentences in decoding when no corresponding references exist. More specifically, the generation is two-phase: (1) DP position detection, which is modeled as a sequential labelling task with recurrent neural networks; and (2) DP prediction, which employs a multilayer perceptron with rich features. Finally, we integrate the above outputs into our translation system to recall missing pronouns by both extracting rules from the DP-labelled training data and translating the DP-generated input sentences. Experimental results show that our approach achieves a significant improvement of 1.58 BLEU points in translation performance with 66% F-score for DP generation accuracy.