CLAISep 30, 2024

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

arXiv:2409.20222v219 citationsh-index: 4Has Code
Originality Highly original
AI Analysis

This benchmark addresses the problem of evaluating the long-term memory, continual learning, and information integration capabilities of large language models for researchers and developers, highlighting challenges in more natural, interleaved interactions that current benchmarks miss.

This paper introduces a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user-agent interaction involving multiple interleaved tasks. The study found that while LLMs perform well on single-task interactions, they struggle significantly when tasks are interleaved, and short-context LLMs with a Long-Term Memory (LTM) system can perform as well as or better than those with larger contexts.

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and agent, where multiple tasks are introduced and then undertaken concurrently. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents. Results from both proprietary and open-source Large-Language Models show that LLMs in general perform well on single-task interactions, but they struggle on the same tasks when they are interleaved. Notably, short-context LLMs supplemented with an LTM system perform as well as or better than those with larger contexts. Our benchmark suggests that there are other challenges for LLMs responding to more natural interactions that contemporary benchmarks have heretofore not been able to capture.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes