DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation
This addresses a bottleneck in deploying accessible and scalable education technologies for domains like mathematics where visualizations are essential.
The authors tackled the problem of scalable evaluation for educational math diagrams generated by LLMs by proposing DiagramIR, an automatic pipeline using intermediate representations of LaTeX TikZ code, which achieved higher agreement with human raters and enabled smaller models to perform comparably to larger ones at 10x lower cost.
Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.