LGAIIRSEJun 17, 2024

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

arXiv:2406.11612v159 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses a gap for researchers and developers in code and natural language processing by offering new benchmarks for long-context code models, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the shortage of benchmarks for code processing with long contexts by introducing Long Code Arena, a suite of six benchmarks for project-wide code tasks, providing datasets, evaluation suites, and baseline solutions.

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes