CLAIIRLGDec 18, 2021

The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

arXiv:2112.09924v278 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of scaling knowledge-intensive NLP to web-scale data for researchers and practitioners, though it is incremental as it builds on existing retrieval methods.

The authors tackled the challenge of knowledge-intensive NLP in an open-domain environment by evaluating tasks using a large web corpus (Sphere) instead of Wikipedia, finding that retrieval from Sphere enabled a state-of-the-art system to match or outperform Wikipedia-based models on several tasks, with dense indices not yet outperforming sparse baselines on Sphere.

In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes