CLSep 23, 2022

News Category Dataset

arXiv:2209.11429v3126 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This provides a large-scale, high-quality dataset for researchers and practitioners in NLP to study authentic news syntax and semantics, addressing the proliferation of fake news.

The authors tackled the need for authentic news data by presenting a News Category Dataset containing around 210k headlines from 2012 to 2022 with metadata, enabling various NLP tasks.

People rely on news to know what is happening around the world and inform their daily lives. In today's world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information is valuable to learning authentic news' Natural Language syntax and semantics. As part of this work, we present a News Category Dataset that contains around 210k news headlines from the year 2012 to 2022 obtained from HuffPost, along with useful metadata to enable various NLP tasks. In this paper, we also produce some novel insights from the dataset and describe various existing and potential applications of our dataset.

View on arXiv PDF

Similar