AIAug 26, 2025

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

arXiv:2508.19035v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the deficiency in evaluating integrated reasoning for LLMs, though it is incremental as it focuses on a new benchmark rather than a fundamental method change.

The paper tackles the problem of evaluating reasoning ability in Large Language Models (LLMs) by introducing a novel black-box interaction paradigm, resulting in the Oracle benchmark where top models like o3 achieve over 70% accuracy on easy tasks but drop below 40% on hard ones.

Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40\%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes