CODEMENV: Benchmarking Large Language Models on Code Migration
This work addresses the need for better evaluation of LLMs in code migration for software engineers, but it is incremental as it primarily introduces a benchmark without proposing a new method.
The authors tackled the problem of insufficiently studied effectiveness of large language models (LLMs) in code migration by introducing CODEMENV, a new benchmark with 922 examples across 19 packages, and found that LLMs achieved an average pass@1 rate of 26.50%, with GPT-4O scoring highest at 43.84%.
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.