AIJan 8

Arabic Prompts with English Tools: A Benchmark

arXiv:2601.05101v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses a critical gap for Arabic-speaking users by providing the first standardized benchmark for Arabic tool-calling, though it is incremental as it extends existing English-focused frameworks to a new language.

The paper tackles the lack of benchmarks for evaluating tool-calling capabilities of Large Language Models in Arabic, revealing that tool-calling accuracy drops by 5-10% when users interact in Arabic compared to English.

Large Language Models (LLMs) are now integral to numerous industries, increasingly serving as the core reasoning engine for autonomous agents that perform complex tasks through tool-use. While the development of Arabic-native LLMs is accelerating, the benchmarks for evaluating their capabilities lag behind, with most existing frameworks focusing on English. A critical and overlooked area is tool-calling, where the performance of models prompted in non-English languages like Arabic is poorly understood, especially since these models are often pretrained on predominantly English data. This paper addresses this critical gap by introducing the first dedicated benchmark for evaluating the tool-calling and agentic capabilities of LLMs in the Arabic language. Our work provides a standardized framework to measure the functional accuracy and robustness of models in Arabic agentic workflows. Our findings reveal a huge performance gap: when users interact in Arabic, tool-calling accuracy drops by an average of 5-10\%, regardless of whether the tool descriptions themselves are in Arabic or English. By shedding light on these critical challenges, this benchmark aims to foster the development of more reliable and linguistically equitable AI agents for Arabic-speaking users.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes