CUBE: A Standard for Unifying Agent Benchmarks
This addresses a critical productivity issue for AI researchers by standardizing benchmark integration to prevent further fragmentation as new benchmarks emerge.
The paper tackles the problem of fragmentation in agent benchmarks by proposing CUBE, a universal protocol standard that unifies benchmarks, resulting in reduced integration effort and enabling cross-platform access without custom work.
The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.