CLHCLGJan 25, 2024

K-QA: A Real-World Medical Q&A Benchmark

arXiv:2401.14493v138 citationsBioNLP
Originality Incremental advance
AI Analysis

This addresses the need for medically accurate NLP applications in healthcare, though it is incremental as it builds on existing evaluation methods with a new dataset.

The authors tackled the problem of ensuring accuracy in large language models for clinical settings by constructing K-QA, a real-world medical Q&A dataset with 1,212 patient questions, and introduced NLI-based metrics for evaluation, finding that in-context learning improves comprehensiveness and augmented retrieval reduces hallucinations.

Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes