45.4CYApr 2
Training-Free Private Synthesis with Validation: A New Frontier for Practical Educational Data SharingHibiki Ito, Chia-Yu Hsu, Hiroaki Ogata
While secondary use of real-world data (RWD) in education offers substantial research opportunities, data sharing is often limited by privacy constraints. Differentially private synthetic data generation (DP-SDG) has emerged as a possible solution. However, educational RWD is fragmented across platforms and institutions and stored in different formats, so DP-SDG must be tailored to each dataset, requiring substantial engineering effort. In addition, such data are often small-sample and high-dimensional, making deep learning (DL)-based methods common but difficult to implement without specialist expertise. In this setting, it is also hard to achieve practically useful downstream utility. As a result, despite its theoretical promise, DP-SDG remains far from a practical solution in education. To address this issue, we propose a more practical two-stage method: (1) training-free, LLM-based DP-SDG is performed for sharing synthetic data and (2) on-demand real-data validation, where researchers submit code for remote validation of results. This simple method is designed for individual data custodians without extensive DP-SDG expertise. It can also be adapted to multi-shot synthesis, where data from different learner cohorts are synthesised regularly. We evaluate this method experimentally in both the one-shot and multi-shot synthesis settings using RWD collected over three years and conduct a case study with real researchers. Results show that LLM-based DP-SDG performs comparably to a DL-based baseline while greatly reducing engineering costs, and that non-DP validation causes measurable but moderate privacy leakage. Nonetheless, in the case study researchers reported that on average only 36% of synthetic findings are validated on real data. Overall, the paper provides a practical method for sharing educational RWD, while highlighting challenges in risk mitigation and epistemic precision.
CYFeb 9
Cyclic Adaptive Private Synthesis for Sharing Real-World Data in EducationHibiki Ito, Chia-Yu Hsu, Hiroaki Ogata
The rapid adoption of digital technologies has greatly increased the volume of real-world data (RWD) in education. While these data offer significant opportunities for advancing learning analytics (LA), secondary use for research is constrained by privacy concerns. Differentially private synthetic data generation is regarded as the gold-standard approach to sharing sensitive data, yet studies on the private synthesis of educational data remain very scarce and rely predominantly on large, low-dimensional open datasets. Educational RWD, however, are typically high-dimensional and small in sample size, leaving the potential of private synthesis underexplored. Moreover, because educational practice is inherently iterative, data sharing is continual rather than one-off, making a traditional one-shot synthesis approach suboptimal. To address these challenges, we propose the Cyclic Adaptive Private Synthesis (CAPS) framework and evaluate it on authentic RWD. By iteratively sharing RWD, CAPS not only fosters open science, but also offers rich opportunities of design-based research (DBR), thereby amplifying the impact of LA. Our case study using actual RWD demonstrates that CAPS outperforms a one-shot baseline while highlighting challenges that warrant further investigation. Overall, this work offers a crucial first step towards privacy-preserving sharing of educational RWD and expands the possibilities for open science and DBR in LA.
CRFeb 7, 2024
Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer LearningMarlon Tobaben, Hibiki Ito, Joonas Jälkö et al.
Membership inference attacks (MIAs) are used to test practical privacy of machine learning models. MIAs complement formal guarantees from differential privacy (DP) under a more realistic adversary model. We analyse MIA vulnerability of fine-tuned neural networks both empirically and theoretically, the latter using a simplified model of fine-tuning. We show that the vulnerability of non-DP models when measured as the attacker advantage at a fixed false positive rate reduces according to a simple power law as the number of examples per class increases. A similar power-law applies even for the most vulnerable points, but the dataset size needed for adequate protection of the most vulnerable points is very large.