CVJul 26, 2025

Region-based Cluster Discrimination for Visual Representation Learning

arXiv:2507.20025v113 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of enhancing dense prediction tasks such as grounding and segmentation for researchers and practitioners in computer vision, representing an incremental improvement over existing methods.

The paper tackles the limitation of global representations in vision-language models for dense prediction tasks by introducing Region-Aware Cluster Discrimination (RICE), which improves region-level visual and OCR capabilities, resulting in consistent outperformance on tasks like segmentation and dense detection.

Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes