The Poison of Alignment
This addresses a critical issue for LLM developers and researchers, revealing that alignment can inadvertently degrade reasoning capabilities, which is incremental as it builds on existing alignment methods.
The paper investigates how alignment in instruction-tuning datasets harms model performance, showing that aligned answers reduce performance on reasoning benchmarks by 4-33% compared to models tuned without alignment.
From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than the counterpart tuned without alignment by 4-33%.