Ensemble Debates with Local Large Language Models for AI Alignment
This addresses AI alignment for high-stakes decisions by providing a reproducible method using local models, though it is incremental as it builds on existing ensemble techniques.
The paper tackles the problem of aligning large language models with human values by studying local open-source ensemble debates, finding that ensembles outperform single-model baselines on a 7-point rubric with overall scores of 3.48 vs. 3.13, including gains in reasoning depth (+19.4%) and argument quality (+34.1%).
As large language models (LLMs) take on greater roles in high-stakes decisions, alignment with human values is essential. Reliance on proprietary APIs limits reproducibility and broad participation. We study whether local open-source ensemble debates can improve alignmentoriented reasoning. Across 150 debates spanning 15 scenarios and five ensemble configurations, ensembles outperform single-model baselines on a 7-point rubric (overall: 3.48 vs. 3.13), with the largest gains in reasoning depth (+19.4%) and argument quality (+34.1%). Improvements are strongest for truthfulness (+1.25 points) and human enhancement (+0.80). We provide code, prompts, and a debate data set, providing an accessible and reproducible foundation for ensemble-based alignment evaluation.