CLAINov 25, 2024

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

arXiv:2411.16797v26 citationsh-index: 3AIAI
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving answer reliability in AI-driven collaborative reasoning systems, though it is incremental as it applies existing consensus methods to LLMs.

The study tackled the problem of unreliable answers from large language models (LLMs) on complex statistical questions without ground truth by using inter-model consensus, finding that Claude and GPT-4 produced more reliable responses with higher agreement rates, while Gemini and LLaMA showed lower reliability.

We propose a collaborative framework in which multiple large language models -- including GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash -- generate and answer complex, PhD-level statistical questions when definitive ground truth is unavailable. Our study examines how inter-model consensus improves both response reliability and identifies the quality of the generated questions. Employing chi-square tests, Fleiss' Kappa, and confidence interval analysis, we quantify consensus rates and inter-rater agreement to assess both response precision and question quality. Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models. In contrast, Gemini and LLaMA exhibit greater variability and lower reliability in question formulation. These findings demonstrate that collaborative interactions among large language models enhance response reliability and provide valuable insights for optimizing AI-driven collaborative reasoning systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes