An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering
This work addresses the challenge of creating robust question answering systems that perform well across diverse domains, though it is incremental as it builds on existing methods like XLNet and sampling techniques.
The paper tackled the problem of building a domain-agnostic question answering model for the MRQA 2019 Shared Task, finding that a simple negative sampling technique combined with per-domain sampling achieved second-best Exact Match and F1 scores in the competition.
To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a simple negative sampling technique to be particularly effective, even though it is typically used for datasets that include unanswerable questions, such as SQuAD 2.0. When applied in conjunction with per-domain sampling, our XLNet (Yang et al., 2019)-based submission achieved the second best Exact Match and F1 in the MRQA leaderboard competition.