LGOct 21, 2024

SoftSRV: Learn to Generate Targeted Synthetic Data

Giulia DeSalvo, Jean-Fracois Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar

arXiv:2410.16534v310.43 citationsh-index: 32

Originality Highly original

AI Analysis

This addresses the labor-intensive and domain-specific limitations of prompt engineering for synthetic data generation, offering a more general and practical solution for improving model performance in domains like coding, math, and reasoning.

The paper tackles the problem of generating targeted synthetic fine-tuning data for task-specific model improvement by introducing SoftSRV, a framework that uses data-driven loss minimization to steer a frozen large language model, resulting in significantly better performance and better distribution matching compared to prompt engineering approaches.

We present a novel framework, SoftSRV, that is used to generate targeted synthetic fine-tuning data for improving task-specific model performance. Given a sample from a target distribution, our proposed framework uses a data-driven loss minimization approach to steer a frozen large language model (LLM) to generate synthetic sequences that are similar to those from the target distribution. SoftSRV provides a practical improvement over common prompt engineering approaches that rely on human-engineered prompt-templates, which can be idiosyncratic, labor-intensive to craft, and may need to be specialized per domain. We empirically evaluate our method against standard baselines guiding a large LLM to generate synthetic data to fine-tune a smaller language model on three different domains (coding, math, reasoning). We perform these evaluations without any particular specialization of the framework to each domain, emphasizing the generality of our approach. We find that SoftSRV improves upon typical prompt engineering approaches, generating targeted data that leads to fine-tuned models with significantly better task-specific performance. In addition, SoftSRV-generated data better matches the target distribution according to the MAUVE similarity metric.

View on arXiv PDF

Similar