CLAINov 21, 2024

FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

arXiv:2411.14054v16 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation of language models in tool-use dialogs, particularly for Korean, but is incremental as it builds on existing function-calling benchmarks.

The study tackled the problem of evaluating language models' generative capabilities in Korean tool-use dialogs by introducing FunctionChat-Bench, a benchmark with 700 items, and found that high accuracy in single-turn tool calls does not ensure superior performance in multi-turn environments.

This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes