No News is Good News: A Critique of the One Billion Word Benchmark
This critique highlights issues with a widely used NLP benchmark, potentially affecting researchers in language modeling and evaluation.
The authors demonstrated that models trained on Common Crawl web data perform worse over time on the One Billion Word Benchmark due to distributional shift, and they found the benchmark contains harmful text and outdated references.
The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.