CRJun 2
Covert Influence Between Language ModelsAvidan Shah, Jay Chooi, Jinghua Ou et al.
As language models increasingly consume one another's outputs, covert influence -- a phenomenon where a sender's payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers undetectable by humans -- becomes a growing risk. We characterize this risk across three interfaces: supervised fine-tuning, on-policy distillation, and in-context learning, and find that they vary in the scale of influence achievable without leaving behind human-visible traces. Using inference-time per-sample attribution scores, we study covert influence across all three interfaces with the ability to select carriers that amplify training-time influence, unlocking payload transfers that prior work could not achieve. We further provide evidence that covert influence with natural-language carriers is a distinct phenomenon from prior studies using number carriers, as the latter is more resistant to human detection and less portable across model families. Together, these results suggest that the risk surface for covert influence is broader than previously recognized, and we study pointwise attribution scoring methods as a tool to investigate and mitigate it.
LGMay 27
Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency TrainingAvidan Shah, Jannik Brinkmann, Rico Angell
As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior on clean prompts and adversarial rewrites, and evaluate its two main variants, output-level (BCT) and activation-level (ACT), across five reasoning models. We formulate both methods as a prompt injection defense and find ACT to be competitive with other training-based defenses while requiring only self-supervised pairs of clean and wrapped prompts. Our experiments also generalize both techniques within the jailbreak setting, demonstrating that ACT remains more robust to adaptive attacks. We also provide mechanistic evidence that ACT's defense against jailbreaks is encoded as a roughly linear shift in activation space at the assistant-turn boundary. After ACT training, we can recover a single steering direction that controls refusal on reasoning models with minimal effect on benign inputs. We find that ACT remains robust even when the model's chain-of-thought is replaced with a compliant trace from the undefended base model, pivoting to refuse prefilled jailbreaks. Together, these results suggest that supervising internal representations is a surprisingly effective and interpretable approach to various forms of safety training in reasoning models.
LGMay 20
On-Policy Consistency Training Improves LLM Safety with Minimal Capability DegradationAndy Han, Kristina Fujimoto, Avidan Shah et al.
Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness. Across three model families, OPCT outperforms its SFT counterpart on all safety desiderata. It nearly halves the sycophancy rate relative to baseline (8.1% vs. 15.4%, compared to 11.2% for SFT). Under an adaptive per-target attacker, OPCT holds jailbreak defense success near 99% on held-out jailbreak behaviors, whereas SFT achieves 87% on average. On safety awareness, OPCT outperforms SFT in two out of three models, and matches it on the other. OPCT also largely avoids the capability regressions that SFT induces, such as a 28-point drop on MATH-500. Our results suggest that consistency training is best implemented as OPCT rather than as SFT, especially when generalization beyond the training distribution is desired.
LGMay 23, 2024
Efficient Mitigation of Bus Bunching through Setter-Based Curriculum LearningAvidan Shah, Danny Tran, Yuhan Tang
Curriculum learning has been growing in the domain of reinforcement learning as a method of improving training efficiency for various tasks. It involves modifying the difficulty (lessons) of the environment as the agent learns, in order to encourage more optimal agent behavior and higher reward states. However, most curriculum learning methods currently involve discrete transitions of the curriculum or predefined steps by the programmer or using automatic curriculum learning on only a small subset training such as only on an adversary. In this paper, we propose a novel approach to curriculum learning that uses a Setter Model to automatically generate an action space, adversary strength, initialization, and bunching strength. Transportation and traffic optimization is a well known area of study, especially for reinforcement learning based solutions. We specifically look at the bus bunching problem for the context of this study. The main idea of the problem is to minimize the delays caused by inefficient bus timings for passengers arriving and departing from a system of buses. While the heavy exploration in the area makes innovation and improvement with regards to performance marginal, it simultaneously provides an effective baseline for developing new generalized techniques. Our group is particularly interested in examining curriculum learning and its effect on training efficiency and overall performance. We decide to try a lesser known approach to curriculum learning, in which the curriculum is not fixed or discretely thresholded. Our method for automated curriculum learning involves a curriculum that is dynamically chosen and learned by an adversary network made to increase the difficulty of the agent's training, and defined by multiple forms of input. Our results are shown in the following sections of this paper.