CLAIFeb 16, 2024

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

arXiv:2402.10753v270 citationsh-index: 40Has CodeACL
AI Analysis

This addresses safety risks for users and developers deploying LLMs in real-world tool applications, but it is incremental as it focuses on identifying rather than solving these issues.

The paper tackles safety issues in large language models (LLMs) during tool learning, identifying six safety scenarios across input, execution, and output stages, and reveals that even GPT-4 is susceptible to these challenges in experiments on 11 models.

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present *ToolSword*, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing **malicious queries** and **jailbreak attacks** in the input stage, **noisy misdirection** and **risky cues** in the execution stage, and **harmful feedback** and **error conflicts** in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes