Evaluating Search Engines and Large Language Models for Answering Health Questions
This work addresses the problem of evaluating information retrieval tools for health-related queries, providing insights for users and developers, though it is incremental in comparing existing technologies.
This study compared search engines, large language models, and retrieval-augmented methods in answering 150 health questions, finding that LLMs achieved about 80% accuracy, outperforming SEs at 50-70%, with RAG improving smaller LLMs by up to 30%.
Search engines (SEs) have traditionally been primary tools for information seeking, but the new Large Language Models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer between 50 and 70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs' effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.