CLJun 17, 2024

COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities

arXiv:2406.12074v327 citations
Originality Incremental advance
AI Analysis

This enables cost-effective and automated surveying of diverse online communities, addressing the slow, costly, and biased nature of traditional social science surveys.

The paper tackles the problem of aligning large language models to online communities to create digital twins for surveying, introducing Community-Cross-Instruct, an unsupervised framework that automatically generates instruction-output pairs from community discussions to finetune and evaluate models, demonstrating accurate representation of political and diet communities on Reddit.

Social scientists use surveys to probe the opinions and beliefs of populations, but these methods are slow, costly, and prone to biases. Recent advances in large language models (LLMs) enable the creating of computational representations or "digital twins" of populations that generate human-like responses mimicking the population's language, styles, and attitudes. We introduce Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs. Given a corpus of a community's online discussions, Community-Cross-Instruct automatically generates instruction-output pairs by an advanced LLM to (1) finetune a foundational LLM to faithfully represent that community, and (2) evaluate the alignment of the finetuned model to the community. We demonstrate the method's utility in accurately representing political and diet communities on Reddit. Unlike prior methods requiring human-authored instructions, Community-Cross-Instruct generates instructions in a fully unsupervised manner, enhancing scalability and generalization across domains. This work enables cost-effective and automated surveying of diverse online communities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes