CLMar 18, 2025
Synthetic Data Generation Using Large Language Models: Advances in Text and CodeMihai Nadas, Laura Diosan, Andreea Tomescu
This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g., classification, question answering) and facilitate code-centric applications (e.g., instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits - cost-effectiveness, broad coverage, and controllable diversity - we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.
CLApr 29, 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language ModelsMihai Nadas, Laura Diosan, Andrei Piscoran et al.
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.
CLSep 9, 2025
Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the CostMihai Nadas, Laura Diosan, Andreea Tomescu et al.
Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.