Looking Beyond Sentence-Level Natural Language Inference for Downstream Tasks
This addresses the limited generalizability of NLI for practitioners in NLP, though it is incremental as it builds on existing datasets and tasks.
The paper tackles the problem that Natural Language Inference (NLI) models fail to generalize to downstream NLP tasks like question answering and text summarization, and shows that creating long-premise NLI datasets from existing QA data leads to competitive QA results and state-of-the-art performance in checking factual correctness of summaries.
In recent years, the Natural Language Inference (NLI) task has garnered significant attention, with new datasets and models achieving near human-level performance on it. However, the full promise of NLI -- particularly that it learns knowledge that should be generalizable to other downstream NLP tasks -- has not been realized. In this paper, we study this unfulfilled promise from the lens of two downstream tasks: question answering (QA), and text summarization. We conjecture that a key difference between the NLI datasets and these downstream tasks concerns the length of the premise; and that creating new long premise NLI datasets out of existing QA datasets is a promising avenue for training a truly generalizable NLI model. We validate our conjecture by showing competitive results on the task of QA and obtaining the best reported results on the task of Checking Factual Correctness of Summaries.