CLAIOct 16, 2024

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

arXiv:2410.12893v37 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the costly and impractical nature of human evaluations for large-scale automated question generation, though it is incremental as it builds on existing LLM capabilities.

The paper tackles the problem of automating the evaluation of open-ended question generation by proposing MIRROR, a system that uses large language models to improve alignment with human expert scores, achieving better Pearson's correlation coefficients and enhancing metrics like relevance and appropriateness.

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes