CLAIApr 2, 2023

Better Language Models of Code through Self-Improvement

arXiv:2304.01228v2226 citationsh-index: 51
Originality Incremental advance
AI Analysis

This work addresses the problem of data scarcity for fine-tuning code language models, which is incremental as it builds on existing models like CodeT5.

The paper tackles the issue of limited supervision for fine-tuning pre-trained language models for code by proposing a data augmentation framework that uses pseudo data generated from pre-training and fine-tuning knowledge, resulting in significant performance improvements in code summarization and generation tasks on the CodeXGLUE benchmark.

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes