From Data to Behavior: Predicting Unintended Model Behaviors Before Training
This addresses the costly and inefficient post hoc evaluation of model biases for AI safety researchers, though it is incremental as it builds on existing methods for data analysis.
The paper tackles the problem of predicting unintended biases and safety risks in large language models before fine-tuning, introducing Data2Behavior as a new task and proposing Manipulating Data Features (MDF), which achieves reliable prediction while using only about 20% of the GPU resources compared to fine-tuning.
Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.