SELGDec 20, 2023

CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

arXiv:2312.12492v11 citationsh-index: 43Has CodeMSR
Originality Synthesis-oriented
AI Analysis

This dataset addresses a gap for researchers studying lifelong learning in code language models by providing temporal data on code evolution.

The authors introduced CodeLL, a lifelong learning dataset for code changes that captures the entire release history of 71 open-source projects, enabling analysis of 2,483 releases at method and API levels to support research on language models in lifelong fine-tuning settings.

Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence of a long-term temporal dimension in existing code change datasets, limiting their suitability in lifelong learning scenarios. In contrast, our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. In this work, we introduce an initial version of CodeLL, comprising 71 machine-learning-based projects mined from Software Heritage. This dataset enables the extraction and in-depth analysis of code changes spanning 2,483 releases at both the method and API levels. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes. Additionally, the dataset can help studying data distribution shifts within software repositories and the evolution of API usages over time.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes