CLHCDec 21, 2024

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

arXiv:2412.16516v23 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating LLM robustness for mobile assistant applications, though it appears incremental as it builds on existing benchmarking approaches.

The paper tackles the challenge of evaluating LLMs in multi-turn human-agent interactions by introducing HammerBench, a benchmark framework for assessing function-calling capabilities in real-world mobile scenarios, which reveals that parameter name errors are a significant source of failure across different interaction scenarios.

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs' function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes