CL AIMar 21

Document-tuning for robust alignment to animals

arXiv:2604.1307623.9

Predicted impact top 35% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For AI alignment researchers, this work shows that document-based value interventions can be effective but are fragile and require preservation strategies.

This paper investigates using synthetic documents to finetune language models for alignment to animal compassion, achieving 77% on the Animal Harm Benchmark versus 40% for instruction-tuning, but the effect degrades after subsequent instruction-tuning.

We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.

View on arXiv PDF

Similar