CLMay 23, 2025

Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting

Gauri Kambhatla, Chantal Shaib, Venkata Govindarajan

arXiv:2505.17390v22 citationsh-index: 4EMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating synthetic data diversity for LLM training, but it is incremental as it focuses on measurement rather than proposing new methods.

The paper measured the lexical diversity of synthetic data generated using fine-grained persona prompting for LLMs, finding that synthetic prompts are less diverse than human-written ones and that fine-grained persona details yield minimal diversity gains compared to simpler methods like length cutoffs.

Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying a length cutoff in the prompt.

View on arXiv PDF

Similar