Turning the Tide: Repository-based Code Reflection
This work addresses a gap in code reflection benchmarks for repository contexts, which is incremental but important for developers and researchers focusing on real-world code maintenance.
The paper tackles the problem of evaluating and improving code large language models (LLMs) for modifying code in multi-file repositories, introducing LiveRepoReflection, a benchmark with 1,888 test cases across 6 languages, and RepoReflectionCoder, a model trained on a new instruction-tuning dataset, with a leaderboard evaluating over 40 LLMs.
Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.