Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

arXiv:2603.1309955.1h-index: 2

AI Analysis

This addresses the problem of opaque reasoning in multimodal AI systems for researchers and developers, offering a diagnostic tool and training method, though it is incremental in benchmarking and reward design.

The authors introduced CRYSTAL, a benchmark with 6,372 instances to evaluate multimodal reasoning through verifiable intermediate steps, revealing systematic failures in models such as universal cherry-picking and disordered reasoning where no model preserved over 60% of steps in order. They also proposed CPR-Curriculum, which improved Match F1 by 32% in training without manual annotation.

We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

View on arXiv PDF

Similar