CLAILGJan 28, 2022

Describing Differences between Text Distributions with Natural Language

arXiv:2201.12323v264 citations
AI Analysis

This provides a tool for researchers and practitioners to quickly analyze text data differences, such as distribution shifts or dataset shortcuts, though it is incremental as it builds on existing language models.

The paper tackles the problem of automatically summarizing differences between two text distributions by learning natural language hypotheses, achieving a performance of 76% in generating human-like descriptions on a benchmark of 54 binary classification tasks.

How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related." To tackle this problem, we fine-tune GPT-3 to propose descriptions with the prompt: "[samples of $D_{0}$] + [samples of $D_{1}$] + the difference between them is_____." We then re-rank the descriptions by checking how often they hold on a larger set of samples with a learned verifier. On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and re-ranking, and our best system using GPT-3 Davinci (175B) reaches 76%. We apply our system to describe distribution shifts, debug dataset shortcuts, summarize unknown tasks, and label text clusters, and present analyses based on automatically generated descriptions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes