CLMay 29

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

arXiv:2605.3093134.2h-index: 40Has Code
Predicted impact top 13% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This benchmark addresses the unclear ability of MLLM agents to sustain exploration in dynamic open worlds, providing a clearer evaluation for the MLLM research community.

This paper introduces MineExplorer, a benchmark for evaluating MLLM agents' open-world exploration in Minecraft, by filtering Minecraft-specific knowledge and composing atomic tasks into multi-hop tasks. Experiments show that while strong MLLM agents handle many single-hop tasks, their performance degrades sharply when coordinating hidden prerequisites over longer trajectories.

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes