Ruilin Yao

CV
h-index28
6papers
36citations
Novelty43%
AI Score51

6 Papers

CVSep 8, 2024Code
Visual Grounding with Multi-modal Conditional Adaptation

Ruilin Yao, Shengwu Xiong, Yichen Zhao et al.

Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.

CVMay 21, 2025Code
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao, Bo Zhang, Jirui Huang et al.

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

CVJul 25, 2025Code
Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation

Tianyu Zou, Shengwu Xiong, Ruilin Yao et al.

This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5$^i$ and COCO-20$^i$ datasets, suggesting its effectiveness. The code is available at: [GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV'25)](https://github.com/tianyu-zou/PAHNet)

CVMay 4
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Ruilin Yao, Shegnwu Xiong, Tianyu Zou et al.

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.

CVSep 17, 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng Xu, Shengwu Xiong, Jiajun Zhang et al.

This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

CVSep 28, 2025
ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

Shilan Zhang, Jirui Huang, Ruilin Yao et al.

Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.