CLMay 29, 2025

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Georgia Tech
arXiv:2505.23662v14 citationsh-index: 13Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of tool use in LLMs for real-world applications, though it is incremental as it builds on existing benchmarks.

The authors tackled the problem of evaluating tool-augmented language models in realistic long-term interactions by introducing ToolHaystack, a benchmark that includes multiple tasks and noise in continuous conversations, and found that 14 state-of-the-art models often struggle significantly, revealing critical gaps in long-term robustness.

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes