PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
This addresses a data scarcity issue for researchers and developers in behavior studies and personalized applications, but it is incremental as it builds on existing synthetic data methods.
The paper tackled the problem of scarce diverse digital footprint data by proposing a method to synthesize realistic digital footprints using LLM agents, resulting in a dataset that is more diverse and realistic than baselines and improves model performance on real-world out-of-distribution tasks.
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.