AI LGFeb 28, 2025

ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments

Pedro Gimenes, Zeyu Cao, Jeffrey Wong, Yiren Zhao

arXiv:2502.21208v1h-index: 5

Originality Incremental advance

AI Analysis

This addresses the challenge of improving reasoning efficiency and accuracy in LLMs for AI applications, representing an incremental advance over prior graph-based methods.

The paper tackles the problem of enhancing LLM reasoning performance by introducing ARIES, a multi-agent architecture where policy LLM agents dynamically adapt problem-solving strategies on thought graphs, achieving up to 29% higher accuracy on HumanEval and reducing inference costs by 35% compared to static methods.

Recent research has shown that LLM performance on reasoning tasks can be enhanced by scaling test-time compute. One promising approach, particularly with decomposable problems, involves arranging intermediate solutions as a graph on which transformations are performed to explore the solution space. However, prior works rely on pre-determined, task-specific transformation schedules which are subject to a set of searched hyperparameters. In this work, we view thought graph transformations as actions in a Markov decision process, and implement policy agents to drive effective action policies for the underlying reasoning LLM agent. In particular, we investigate the ability for another LLM to act as a policy agent on thought graph environments and introduce ARIES, a multi-agent architecture for reasoning with LLMs. In ARIES, reasoning LLM agents solve decomposed subproblems, while policy LLM agents maintain visibility of the thought graph states, and dynamically adapt the problem-solving strategy. Through extensive experiments, we observe that using off-the-shelf LLMs as policy agents with no supervised fine-tuning (SFT) can yield up to $29\%$ higher accuracy on HumanEval relative to static transformation schedules, as well as reducing inference costs by $35\%$ and avoid any search requirements. We also conduct a thorough analysis of observed failure modes, highlighting that limitations on LLM sizes and the depth of problem decomposition can be seen as challenges to scaling LLM-guided reasoning.

View on arXiv PDF

Similar