CLCRLGJul 24, 2025

Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

arXiv:2507.18055v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses privacy and diversity challenges in synthetic data generation for text-based applications, representing an incremental improvement over existing methods.

The paper tackled the problem of limited diversity and privacy risks in synthetic review data generated by LLMs, proposing metrics to evaluate these aspects and a prompt-based method that improved diversity while preserving privacy, though specific numerical gains were not detailed.

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes