To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering
This addresses the robustness issue for real-world applications of ODQA models in different domains, with incremental contributions in evaluation and interventions.
The paper tackles the problem of domain adaptation in open-domain question answering by evaluating end-to-end model performance under realistic domain shifts, finding that models fail to generalize and high retrieval scores do not ensure good answer accuracy, and proposes intervention methods that improve answer F1 score by up to 24 points.
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.