SEAIDec 19, 2024

CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

arXiv:2412.14764v110 citationsh-index: 8Has Code
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for researchers and practitioners in software engineering to evaluate AI models, but it is incremental as it builds on existing QA datasets by focusing on repository-level scenarios.

The authors tackled the lack of a large-scale benchmark for evaluating repository-level question-answering in software engineering by introducing CodeRepoQA, a dataset with 585,687 entries across five programming languages, and found that large language models still have limitations in this domain, with medium-length contexts performing better.

In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models. To construct this dataset, we crawl data from 30 well-known repositories in GitHub, the largest platform for hosting and collaborating on code, and carefully filter raw data. In total, CodeRepoQA is a multi-turn question-answering benchmark with 585,687 entries, covering a diverse array of software engineering scenarios, with an average of 6.62 dialogue turns per entry. We evaluate ten popular large language models on our dataset and provide in-depth analysis. We find that LLMs still have limitations in question-answering capabilities in the field of software engineering, and medium-length contexts are more conducive to LLMs' performance. The entire benchmark is publicly available at https://github.com/kinesiatricssxilm14/CodeRepoQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes