CLDec 29, 2020

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

arXiv:2012.14919v2720 citations
AI Analysis

This dataset addresses the need for large-scale, long-form data-to-text generation benchmarks, which is a problem for researchers working on natural language generation.

This paper introduces WikiTableT, a large-scale dataset for data-to-text generation, which pairs millions of Wikipedia sections with their corresponding tabular data and metadata. Benchmarking various strategies, the authors found that while the best approaches generate fluent and high-quality text, they struggle with coherence and factuality.

Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes