AIMay 7

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

arXiv:2605.0686942.3

AI Analysis

Provides a common evaluation ground for diverse agent paradigms, enabling fair comparison and driving progress toward general autonomous agents.

Agentick is a unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents on 37 procedurally generated tasks across six capability categories. Evaluation of 27 configurations over 90,000 episodes shows no single approach dominates, with GPT-5 mini leading at 0.309 oracle-normalized score, and substantial room for improvement remains across all paradigms.

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

View on arXiv PDF

Similar