Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation
This work highlights a critical limitation in open LLMs for data-to-text generation, which is important for applications requiring accurate information extraction, though it is incremental in focusing on error analysis rather than solving the problem.
The study analyzed open large language models (LLMs) on data-to-text generation, finding that while they produce fluent text in zero-shot settings, over 80% of outputs contain semantic errors, as measured by human annotators and a GPT-4-based metric.
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.