A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
This addresses the problem of AI-human alignment in mathematical thought partnerships, particularly for researchers and educators, but is incremental as it highlights limitations without proposing new solutions.
The paper investigates how well large language models (LLMs) align with human judgments on the interestingness and difficulty of math problems, finding that while LLMs broadly agree with human notions, they fail to capture the distribution of human judgments and show weak correlation with human rationales.
The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.