Linhao Wang

CV
h-index18
3papers
Novelty58%
AI Score48

3 Papers

85.6CVMay 29Code
Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

Chen Yang, Guanxin Lin, Youquan He et al.

Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.

25.0CVMay 30
VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

Qichang Zhang, Hailong Wang, Baiang Li et al.

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

MASep 18, 2025
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning

Simin Li, Zheng Yuwei, Zihao Mao et al.

Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance. In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where the upper level involves an NP-hard combinatorial task of selecting the most vulnerable agents, and the lower level learns worst-case adversarial policies for these agents using mean-field MARL. The two problems are coupled together, making HAD-MFC difficult to solve. To solve this, we first decouple the hierarchical process by Fenchel-Rockafellar transform, resulting a regularized mean-field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We then reformulate the upper-level combinatorial problem as a MDP with dense rewards from our regularized mean-field Bellman operator, enabling us to sequentially identify the most vulnerable agents by greedy and RL algorithms. This decomposition provably preserves the optimal solution of the original HAD-MFC. Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learns a value function that reveals the vulnerability of each agent.