LGMar 9

MJ1: Multimodal Judgment via Grounded Verification

arXiv:2603.07990v116.7h-index: 6

Predicted impact top 9% in LG · last 90 daysOriginality Highly original

AI Analysis

This work is significant for the multimodal AI community, as it provides a method to improve the visual grounding and judgment capabilities of multimodal models, particularly for tasks requiring detailed visual evidence.

The paper introduces MJ1, a multimodal judge that addresses the challenge of grounding decisions in visual evidence by employing a structured grounded verification chain and a counterfactual consistency reward. This approach significantly improves accuracy on MMRB2, achieving 77.0% and outperforming larger models like Gemini-3-Pro, with a +3.8 point improvement on Image Editing and +1.7 on Multimodal Reasoning even without training.

Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.

View on arXiv PDF

Similar