LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
For researchers and practitioners developing LLM-based tutors, this work demonstrates that prompt optimization alone can match or exceed RL-based methods, significantly reducing computational requirements.
We show that training-free prompt optimization can outperform RL-trained baselines in math tutoring, with ParetoGrad achieving the best balance across post-test solve rate, leak control, and helpfulness, while using only API calls.
Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.