CLJan 19, 2024

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen

arXiv:2401.10716v13.48 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of data-efficient adaptation for code language models, which is incremental as it builds on existing pre-training methods without architectural changes.

The paper tackled the problem of adapting pre-trained code language models more efficiently by incorporating program structures, specifically using parse trees, and found that this approach yields improvements over plain-text baselines, especially with limited training data.

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.

View on arXiv PDF

Similar