Short-answer scoring with ensembles of pretrained language models
This work addresses automated grading for educational applications, but it is incremental as it builds on existing pretrained models and ensemble techniques.
The paper tackled the problem of automated short-answer scoring by investigating ensembles of pretrained transformer-based language models, finding that while larger models alone fell short of state-of-the-art results, certain ensembles achieved state-of-the-art performance but were too large for practical deployment.
We investigate the effectiveness of ensembles of pretrained transformer-based language models on short answer questions using the Kaggle Automated Short Answer Scoring dataset. We fine-tune a collection of popular small, base, and large pretrained transformer-based language models, and train one feature-base model on the dataset with the aim of testing ensembles of these models. We used an early stopping mechanism and hyperparameter optimization in training. We observe that generally that the larger models perform slightly better, however, they still fall short of state-of-the-art results one their own. Once we consider ensembles of models, there are ensembles of a number of large networks that do produce state-of-the-art results, however, these ensembles are too large to realistically be put in a production environment.