Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts
For practitioners needing robust detection of LLM-generated text, S2D offers a method that improves discriminative power over raw hidden representations, though it is an incremental improvement over existing representation-based detectors.
The authors propose Steer-to-Detect (S2D), a two-stage framework that learns a steering vector to enhance separability of hidden representations in a frozen LLM for detecting machine-generated text. S2D achieves strong and consistent performance across out-of-distribution and adversarial settings, with theoretical guarantees on error rates.
The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.