LG AIMar 26, 2025

Cyborg Data: Merging Human with AI Generated Training Data

arXiv:2503.22736v17.11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the time and cost issues in large-scale assessment for educational or testing organizations, but it is incremental as it builds on existing model distillation and generative AI techniques.

The paper tackles the problem of reducing the need for large quantities of hand-scored data in automated scoring systems by proposing a model distillation pipeline where a large generative model (Teacher) trained on a small subset of data generates scores for the remaining data, creating 'Cyborg Data' that combines human and machine-scored responses. The result shows that smaller Student models trained on this data achieve performance comparable to using the entire dataset while requiring only 10% of the original hand-scored data.

Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset "Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on "Cyborg Data" show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.

View on arXiv PDF

Similar