Diego Zapata‐Rivera

8.9AIJul 8

Representation Robustness Under Executable Reasoning Constraints in Large Language Models for Mathematical Problem Solving

Sagnik Nath, Edith Aurora Graf, Liang Zhang et al.

Large language models (LLMs) are increasingly evaluated on mathematical problem solving, yet prior work often treats representationally equivalent formulations as interchangeable and conflates reasoning errors with interface failures. This paper investigates representation robustness in LLM-based mathematical problem solving by systematically varying surface representations of the same underlying problems, including story problems, word-equations, symbolic equations, and isomorphic paraphrases. Using a curated dataset of mathematically equivalent problems, we evaluate five contemporary LLMs under a direct answer generation condition. We find substantial representational sensitivity: models frequently change correctness across equivalent formulations, with nontrivial flip rates across story, symbolic, and word-equation variants. We also observe systematic regressions under isomorphic reformulations, showing that even subtle paraphrase-level changes can degrade performance despite preserved mathematical structure. We then evaluate a code-augmented condition in which models externalize reasoning as executable Python code that is run locally for validation. This interface reveals strong latent reasoning capability in some models that perform poorly under direct prompting, but it does not uniformly improve robustness. Instead, failures shift across interaction layers, from opaque reasoning errors to protocol violations and execution failures. Even when executable reasoning succeeds, representation sensitivity often persists. Overall, our results show that reasoning scaffolds do not eliminate representational brittleness, but expose new tradeoffs among correctness, reliability, latency, and cost. We argue that representation should be treated as a first-class interface design variable in LLM evaluation and deployment, especially for AI-assisted problem-solving systems.

4.1HCMay 2, 2025

Exploring Communication Strategies for Collaborative LLM Agents in Mathematical Problem-Solving

Liang Zhang, Xiaoming Zhai, Jionghao Lin et al.

Large Language Model (LLM) agents are increasingly utilized in AI-aided education to support tutoring and learning. Effective communication strategies among LLM agents improve collaborative problem-solving efficiency and facilitate cost-effective adoption in education. However, little research has systematically evaluated the impact of different communication strategies on agents' problem-solving. Our study examines four communication modes, \textit{teacher-student interaction}, \textit{peer-to-peer collaboration}, \textit{reciprocal peer teaching}, and \textit{critical debate}, in a dual-agent, chat-based mathematical problem-solving environment using the OpenAI GPT-4o model. Evaluated on the MATH dataset, our results show that dual-agent setups outperform single agents, with \textit{peer-to-peer collaboration} achieving the highest accuracy. Dialogue acts like statements, acknowledgment, and hints play a key role in collaborative problem-solving. While multi-agent frameworks enhance computational tasks, effective communication strategies are essential for tackling complex problems in AI education.

Diego Zapata‐Rivera

2 Papers