XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering
This work addresses the need for scalable, high-quality data to build extractive QA systems for physicians and researchers querying medical records, though it is incremental as it builds on existing explainer and transformer methods.
The paper tackled the problem of generating synthetic question-answering pairs for extractive QA in medical records, which typically requires expert annotations, by introducing XAIQA, a method that uses classification model explainers to create QA pairs from electronic health records. The result showed that XAIQA identified 2.2 times more semantic matches and 3.8 times more clinical abbreviations than popular sentence transformer approaches in expert evaluations, and improved GPT-4's performance on difficult questions in ML evaluations.
Extractive question answering (QA) systems can enable physicians and researchers to query medical records, a foundational capability for designing clinical studies and understanding patient medical history. However, building these systems typically requires expert-annotated QA pairs. Large language models (LLMs), which can perform extractive QA, depend on high quality data in their prompts, specialized for the application domain. We introduce a novel approach, XAIQA, for generating synthetic QA pairs at scale from data naturally available in electronic health records. Our method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes. In an expert evaluation with two physicians, our method identifies $2.2\times$ more semantic matches and $3.8\times$ more clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs. In an ML evaluation, adding our QA pairs improves performance of GPT-4 as an extractive QA model, including on difficult questions. In both the expert and ML evaluations, we examine trade-offs between our method and sentence transformers for QA pair generation depending on question difficulty.