Yimeng Ye

h-index9
2papers

2 Papers

CLOct 9, 2025Code
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

Shuang Chen, Yue Guo, Yimeng Ye et al.

Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

CVJun 20, 2025
Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

Chao Chen, Nobel Dang, Juexiao Zhang et al.

Humans exhibit a remarkable ability to recognize co-visibility-the 3D regions simultaneously visible in multiple images-even when these images are sparsely distributed across a complex scene. This ability is foundational to 3D vision, robotic perception, and relies not only on low-level feature matching but also on high-level spatial reasoning and cognitive integration. Yet, it remains unclear whether current vision models can replicate this human-level proficiency. In this work, we introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across more than 1,000 sparse-view indoor scenarios. Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions. Notably, a proprietary vision-language model surpasses all vision-only baselines, but all models fall significantly short of human performance. This gap underscores the limitations of current architectures and motivates the need for models that integrate spatial and semantic information in a human-like manner. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, cognitively inspired reasoning in challenging, sparse environments. Our dataset and source code can be found at https://ai4ce.github.io/CoVISION.