News Headline Grouping as a Challenging NLU Task
This work addresses the problem of grouping news headlines for NLU researchers, presenting a challenging new benchmark with a significant performance gap between humans and models, though it is incremental in introducing a specific task and dataset.
The paper introduces the HeadLine Grouping (HLG) task and a dataset (HLGD) of 20,056 headline pairs, where human annotators achieve 0.9 F-1, but state-of-the-art Transformer models only reach 0.75 F-1, highlighting a performance gap. It also proposes an unsupervised Headline Generator Swap model that comes within 3 F-1 of the best supervised model and finds that models lack consistency in predictions.
Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding dataset (HLGD) consisting of 20,056 pairs of news headlines, each labeled with a binary judgement as to whether the pair belongs within the same group. On HLGD, human annotators achieve high performance of around 0.9 F-1, while current state-of-the art Transformer models only reach 0.75 F-1, opening the path for further improvements. We further propose a novel unsupervised Headline Generator Swap model for the task of HeadLine Grouping that achieves within 3 F-1 of the best supervised model. Finally, we analyze high-performing models with consistency tests, and find that models are not consistent in their predictions, revealing modeling limits of current architectures.