Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
This work tackles the problem of ensuring authenticity in student submissions for educators and assessment bodies, but it is incremental as it builds on existing detection methods.
The chapter addresses the challenge of detecting AI-generated essays in writing assessment by evaluating how detectors trained on essays from one large language model generalize to others, using GRE writing prompts, and finds that generalization is limited, necessitating retraining for practical use.
Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.