CLSep 27, 2021

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

arXiv:2109.13296v1676 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for standardized evaluation in fake news detection and similar applications, though it is incremental as it builds on existing models and tasks.

The authors tackled the lack of a systematic benchmark for distinguishing machine-generated from human-written texts by introducing TuringBench, a benchmark environment with datasets and tasks, and found that FAIR_wmt20 and GPT-3 produce the most human-like texts with the lowest F1 scores in detection tests.

Recent progress in generative language models has enabled machines to generate astonishingly realistic texts. While there are many legitimate applications of such models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). However, to our best knowledge, there is currently no benchmark environment with datasets and tasks to systematically study the so-called "Turing Test" problem for neural text generation methods. In this work, we present the TuringBench benchmark environment, which is comprised of (1) a dataset with 200K human- or machine-generated samples across 20 labels {Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large, FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2}, (2) two benchmark tasks -- i.e., Turing Test (TT) and Authorship Attribution (AA), and (3) a website with leaderboards. Our preliminary experimental results using TuringBench show that FAIR_wmt20 and GPT-3 are the current winners, among all language models tested, in generating the most human-like indistinguishable texts with the lowest F1 score by five state-of-the-art TT detection models. The TuringBench is available at: https://turingbench.ist.psu.edu/

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes