Long Context Automated Essay Scoring with Language Models
This addresses validity concerns in educational assessment by enabling more accurate scoring of long essays, though it is incremental as it applies existing methods to a specific domain.
The study tackled the problem of automated essay scoring for long student essays that exceed transformer models' length limits, evaluating several modified transformer architectures on the Kaggle ASAP 2.0 dataset to overcome truncation issues.
Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model's ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.