CL AI SEFeb 23, 2025

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen

arXiv:2502.16645v214.710 citationsh-index: 14Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a critical issue for software engineering practitioners using LLMs, as it highlights a significant limitation in current models and provides a benchmark for future improvements, though it is incremental in nature.

The paper tackles the problem of large language models (LLMs) struggling to adapt to evolving code knowledge, particularly third-party library API updates, by introducing CODESYNCBENCH, a benchmark with 3,300 test cases covering 220 APIs from six Python libraries, and experiments on 14 state-of-the-art LLMs show they fail to handle dynamic code evolution effectively.

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.

View on arXiv PDF Code

Similar