Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
This addresses the challenge of accurately scoring long essays in educational assessment, though it appears incremental as it builds on existing automated scoring methods.
The research tackled the problem of automated scoring for long essays, which is limited by token constraints in encoder-based models like BERT, by exploring generative language models with summarization and prompting, resulting in an increase in scoring accuracy with QWK rising from 0.822 to 0.8878 on a specific dataset.
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.