LG AI CYFeb 1, 2025

How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Antonio-Gabriel Chacón Menke, Phan Xuan Tan

arXiv:2503.17365v23 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

This study addresses safety risks in small LLMs for AI alignment researchers, but it is incremental as it applies an existing method to new models.

The paper investigated the effectiveness of Constitutional AI's self-critique mechanism in reducing harm in small, uncensored LLMs like DeepSeek-R1-8B, finding that while Llama-based models showed significant harm reduction, other architectures had less improvement, indicating variability based on model architecture and reasoning capabilities.

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

View on arXiv PDF

Similar