From Data to Behavior: Predicting Unintended Model Behaviors Before Training

arXiv:2602.04735v11 citationsh-index: 37
AI Analysis

This addresses the costly and inefficient post hoc evaluation of model biases for AI safety researchers, though it is incremental as it builds on existing methods for data analysis.

The paper tackles the problem of predicting unintended biases and safety risks in large language models before fine-tuning, introducing Data2Behavior as a new task and proposing Manipulating Data Features (MDF), which achieves reliable prediction while using only about 20% of the GPU resources compared to fine-tuning.

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes