CLMar 19, 2025

ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Arina Razmyslovich, Kseniia Murasheva, Sofia Sedlova, Julien Capitaine, Eugene Dmitriev

arXiv:2503.15055v24 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of making LLMs more effective for specialized domains like cybersecurity, though it appears incremental as it builds on existing synthetic data generation approaches.

The paper tackles the problem of LLM domain specialization by introducing the ELTEX framework for synthetic data generation, which in a cybersecurity case study enabled a fine-tuned Gemma-2B model to achieve performance competitive with GPT-4o on blockchain cyberattack classification while reducing computational requirements.

We introduce Efficient LLM Token Extraction (ELTEX), a framework addressing the critical challenge of LLM domain specialization by systematically extracting and integrating domain indicators throughout synthetic data generation. Unlike approaches relying on implicit knowledge transfer, ELTEX explicitly leverages domain signals to maintain specialized knowledge integrity. In our cybersecurity case study, ELTEX-enhanced data enables a fine-tuned Gemma-2B model to achieve performance competitive with GPT-4o on blockchain cyberattack classification while reducing computational requirements. Our Google Sheets implementation makes ELTEX accessible to non-technical users. Our contributions include: (1) the ELTEX framework; (2) Google Sheets Add-on implementation; (3) empirical validation showing how ELTEX bridges performance gaps between small and large models; and (4) a synthetic dataset of 11,448 texts for blockchain cyberattack detection.

View on arXiv PDF

Similar