nvBench: A Large-Scale Synthesized Dataset for Cross-Domain Natural Language to Visualization Task
This addresses the data bottleneck for researchers and developers working on cross-domain NL2VIS systems, though it is incremental as it synthesizes existing (NL, SQL) benchmarks.
The authors tackled the lack of large-scale benchmarks for natural language to visualization (NL2VIS) tasks by creating nvBench, a dataset with 25,750 (NL, VIS) pairs from 750 tables across 105 domains, validated by experts and crowd workers, which enabled training deep learning models to advance the field.
NL2VIS - which translates natural language (NL) queries to corresponding visualizations (VIS) - has attracted more and more attention both in commercial visualization vendors and academic researchers. In the last few years, the advanced deep learning-based models have achieved human-like abilities in many natural language processing (NLP) tasks, which clearly tells us that the deep learning-based technique is a good choice to push the field of NL2VIS. However, a big balk is the lack of benchmarks with lots of (NL, VIS) pairs. We present nvBench, the first large-scale NL2VIS benchmark, containing 25,750 (NL, VIS) pairs from 750 tables over 105 domains, synthesized from (NL, SQL) benchmarks to support cross-domain NL2VIS task. The quality of nvBench has been extensively validated by 23 experts and 300+ crowd workers. Deep learning-based models training using nvBench demonstrate that nvBench can push the field of NL2VIS.