LGMar 3

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

arXiv:2603.02635v1h-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses safety issues in multimodal AI systems, particularly for applications requiring reliable content moderation, but it is incremental as it builds on existing alignment techniques with a novel structured approach.

The paper tackles the susceptibility of vision-language models to multimodal jailbreaks and over-refusal by introducing SaFeR-ToolKit, a method that formalizes safety decision-making as a checkable protocol using virtual tool calling, resulting in significant improvements in safety, helpfulness, and reasoning rigor scores (e.g., from 29.39 to 84.40 on a 3B model).

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes