CVDec 19, 2025

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

arXiv:2512.17838v11 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses the gap in evaluating autonomous agents for demanding medical imaging tasks, which is incremental as it builds on existing agent benchmarks by focusing on domain-specific workflows.

The authors tackled the problem of autonomous coding agents being ineffective on complex, domain-specific scientific tasks like medical imaging by introducing ReX-MLE, a benchmark of 20 challenges derived from medical imaging competitions, and found that state-of-the-art agents performed poorly, with most submissions ranking in the 0th percentile compared to human experts.

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes