CLAILGJul 22, 2024

RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering

arXiv:2407.15621v325 citationsh-index: 43
Originality Incremental advance
AI Analysis

This addresses the problem of inaccurate medical information from LLMs for radiologists and healthcare professionals, though it is incremental as it builds on existing RAG methods with a domain-specific application.

The paper tackled the problem of outdated or inaccurate information generated by large language models (LLMs) in radiology question answering by developing RadioRAG, an online retrieval-augmented generation framework that retrieves data from authoritative radiologic sources in real-time. The result showed that RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases up to 54%, matching or exceeding non-RAG models and human radiologists in some subspecialties.

Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval-augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from Radiopaedia in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG's effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. RadioRAG shows potential to improve LLM accuracy and factuality in radiology question answering by integrating real-time domain-specific data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes