Language Models as Science Tutors
This work addresses the need for language models to process long scientific documents for educational applications, though it is incremental as it builds on existing models and datasets.
The authors tackled the problem of language models lacking real-life usability as scientific assistants in education by introducing TutorEval, a benchmark for long-context STEM question-answering, and TutorChat, a synthetic dialogue dataset, which when used to fine-tune Llemma models resulted in strong performance on TutorEval and other math benchmarks.
NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.