CLSep 11, 2023

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

arXiv:2309.05454v298 citationsh-index: 15Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of aligning AI-generated educational materials with readability standards for teachers and educators, though it is incremental as it compares existing models on specific tasks.

The study evaluated instruction-tuned language models on tasks like story completion and narrative simplification using readability standards, finding that models like ChatGPT were less effective and required more refined prompts compared to open-source models such as BLOOMZ and FlanT5, which showed promising results.

Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives--tasks that teachers perform--using standard-guided prompts controlling text readability. Our extensive findings provide empirical proof of how globally recognized models like ChatGPT may be considered less effective and may require more refined prompts for these generative tasks compared to other open-sourced models such as BLOOMZ and FlanT5--which have shown promising results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes