How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers
This study addresses safety risks in small LLMs for AI alignment researchers, but it is incremental as it applies an existing method to new models.
The paper investigated the effectiveness of Constitutional AI's self-critique mechanism in reducing harm in small, uncensored LLMs like DeepSeek-R1-8B, finding that while Llama-based models showed significant harm reduction, other architectures had less improvement, indicating variability based on model architecture and reasoning capabilities.
Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.