CLFeb 19, 2025

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici

arXiv:2502.13791v210.96 citationsh-index: 22Has CodeACL

Originality Incremental advance

AI Analysis

This addresses a fundamental limitation for users relying on LLMs for collaborative coding tasks, highlighting an incremental but critical gap in their long-term interaction capabilities.

The study tackled the problem of whether LLMs can collaborate effectively over long-term interactions by testing their ability to track and execute coding instructions across multiple sessions, finding that even state-of-the-art models like GPT-4o show deteriorating performance due to failures in retrieving and integrating information over long instruction chains.

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

View on arXiv PDF Code

Similar