CLSep 11, 2023

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Joseph Marvin Imperial, Harish Tayyar Madabushi

arXiv:2309.05454v219.198 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of aligning AI-generated educational materials with readability standards for teachers and educators, though it is incremental as it compares existing models on specific tasks.

The study evaluated instruction-tuned language models on tasks like story completion and narrative simplification using readability standards, finding that models like ChatGPT were less effective and required more refined prompts compared to open-source models such as BLOOMZ and FlanT5, which showed promising results.

Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives--tasks that teachers perform--using standard-guided prompts controlling text readability. Our extensive findings provide empirical proof of how globally recognized models like ChatGPT may be considered less effective and may require more refined prompts for these generative tasks compared to other open-sourced models such as BLOOMZ and FlanT5--which have shown promising results.

View on arXiv PDF Code

Similar