RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

arXiv:2606.0602714.6Has Code

Predicted impact top 60% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers studying community-specific language model adaptation, this framework enables reproducible comparisons of different community definitions and evaluation metrics.

RedditPersona standardizes community-conditioned LLM adaptation by providing a modular framework for data collection, user profiling, grouping strategies, and evaluation. Applied to 112 subreddits, it reveals a consistent trade-off between behavioral identifiability and distributional similarity across five grouping strategies.

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

View on arXiv PDF Code

Similar