CLAIMar 21

Document-tuning for robust alignment to animals

arXiv:2604.1307623.9
Predicted impact top 35% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For AI alignment researchers, this work shows that document-based value interventions can be effective but are fragile and require preservation strategies.

This paper investigates using synthetic documents to finetune language models for alignment to animal compassion, achieving 77% on the Animal Harm Benchmark versus 40% for instruction-tuning, but the effect degrades after subsequent instruction-tuning.

We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes