CLAINov 20, 2022

The Stack: 3 TB of permissively licensed source code

Hugging Face
arXiv:2211.15533v1462 citationsh-index: 39
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited open-source code datasets for AI researchers, enabling more responsible and accessible development of code-generating models, though it is incremental in providing a new dataset rather than a novel method.

The researchers tackled the need for large, permissively licensed datasets for training LLMs on code by introducing The Stack, a 3.1 TB dataset covering 30 programming languages, and showed that using this data can match prior performance on benchmarks like HumanEval and MBPP.

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes