LG AIApr 22

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Tapiwa Amion Chinodakufa, Ashfaq Ali Shafin, Khandaker Mamun Ahmed

arXiv:2604.2103120.4h-index: 3

AI Analysis

For educational technology practitioners, this provides the first systematic benchmark and practical decision framework for choosing between traditional and deep learning synthetic data methods.

This study benchmarks traditional resampling (SMOTE, Bootstrap, Random Oversampling) against deep generative models (Autoencoder, VAE, Copula-GAN) on a 10,000-record student dataset, finding a trade-off: resampling achieves near-perfect utility (TSTR 0.997) but no privacy (DCR ~0.00), while deep models provide strong privacy (DCR ~1.00) with utility loss; VAEs offer the best compromise with 83.3% predictive performance and complete privacy.

Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility such as Train-on-Synthetic-Test-on-Real scores (TSTR), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but completely fail privacy protection (DCR ~ 0.00), while deep learning models provide strong privacy guarantees (DCR ~ 1.00) at significant utility cost. Variational Autoencoders emerge as the optimal compromise, maintaining 83.3% predictive performance while ensuring complete privacy protection. We also provide actionable recommendations: use traditional resampling for internal development where privacy is controlled, and VAEs for external data sharing where privacy is paramount. This work establishes a foundational benchmark and practical decision framework for synthetic data generation in learning analytics.

View on arXiv PDF

Similar