LGAICYFeb 1, 2025

How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

arXiv:2503.17365v23 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This study addresses safety risks in small LLMs for AI alignment researchers, but it is incremental as it applies an existing method to new models.

The paper investigated the effectiveness of Constitutional AI's self-critique mechanism in reducing harm in small, uncensored LLMs like DeepSeek-R1-8B, finding that while Llama-based models showed significant harm reduction, other architectures had less improvement, indicating variability based on model architecture and reasoning capabilities.

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes