CLFeb 19, 2025

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

arXiv:2502.13791v25 citationsh-index: 22ACL
Originality Incremental advance
AI Analysis

This addresses a fundamental limitation for users relying on LLMs for collaborative coding tasks, highlighting an incremental but critical gap in their long-term interaction capabilities.

The study tackled the problem of whether LLMs can collaborate effectively over long-term interactions by testing their ability to track and execute coding instructions across multiple sessions, finding that even state-of-the-art models like GPT-4o show deteriorating performance due to failures in retrieving and integrating information over long instruction chains.

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes