CLMar 28, 2023

Synthetically generated text for supervised text analysis

MIT
arXiv:2303.16028v116 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This provides a partial solution for political scientists facing data scarcity and privacy concerns, though it is incremental in nature.

The paper tackles the challenges of supervised text models in political science, such as high labeling costs and data privacy issues, by proposing synthetic text generation with large language models, demonstrated through applications like generating tweets about Ukraine and news articles for event detection.

Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes