CLAILGFeb 6, 2025

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

arXiv:2502.03711v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of institutional adoption of LLMs by providing a framework to measure confidence and consistency, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of LLM hallucination by proposing MultiQ&A, a systematic approach for evaluating robustness and consistency, which examined 1.9 million question perturbations and 2.3 million answers, showing that ensembled LLMs like gpt-3.5-turbo remain relatively robust.

One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes