Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

Pratyush Acharya, Prasansha Bharati, Yokibha Chapagain, Isha Sharma Gauli, Kiran Parajuli

arXiv:2604.0961948.4h-index: 4

Predicted impact top 49% in CY · last 90 daysOriginality Incremental advance

AI Analysis

It addresses the readiness of AI tutors for low-resource, non-Western educational contexts, highlighting incremental improvements needed for deployment.

This study evaluated four large language models as AI tutors for Nepal's K-10 curriculum, revealing a curriculum-alignment gap where models like GPT-4o achieved 97% reliability but showed deficiencies in clarity and cultural contextualization, with regional models failing in over 20% of interactions.

The integration of Large Language Models (LLMs) into educational ecosystems promises to democratize access to personalized tutoring, yet the readiness of these systems for deployment in non-Western, low-resource contexts remains critically under-examined. This study presents a systematic evaluation of four state-of-the-art LLMs--GPT-4o, Claude Sonnet 4, Qwen3-235B, and Kimi K2--assessing their capacity to function as AI tutors within the specific curricular and cultural framework of Nepal's Grade 5-10 Science and Mathematics education. We introduce a novel, curriculum-aligned benchmark and a fine-grained evaluation framework inspired by the "natural language unit tests" paradigm, decomposing pedagogical efficacy into seven binary metrics: Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, and Solution Accuracy. Our results reveal a stark "curriculum-alignment gap." While frontier models (GPT-4o, Claude Sonnet 4) achieve high aggregate reliability (approximately 97%), significant deficiencies persist in pedagogical clarity and cultural contextualization. We identify two pervasive failure modes: the "Expert's Curse," where models solve complex problems but fail to explain them clearly to novices, and the "Foundational Fallacy," where performance paradoxically degrades on simpler, lower-grade material due to an inability to adapt to younger learners' cognitive constraints. Furthermore, regional models like Kimi K2 exhibit a "Contextual Blindspot," failing to provide culturally relevant examples in over 20% of interactions. These findings suggest that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. We propose a "human-in-the-loop" deployment strategy and offer a methodological blueprint for curriculum-specific fine-tuning to align global AI capabilities with local educational needs.

View on arXiv PDF

Similar