Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains
This addresses the scarcity of multilingual resources for reliable question answering in emergent domains, though it is incremental as it builds on existing methods.
The paper tackles the problem of cross-lingual open-retrieval question answering in low-resource emergent domains like COVID-19, showing that a deep semantic retriever trained on automatically generated English-to-all data significantly outperforms a BM25 baseline.
Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system.