PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
This work addresses the need for fine-grained object understanding in MLLMs, which is crucial for applications requiring detailed visual analysis, though it is incremental in advancing existing MLLM capabilities.
The paper tackles the problem of fine-grained, object-centric reasoning in multimodal large language models (MLLMs) by introducing PixelRefer, a unified framework for spatio-temporal object referring with arbitrary granularity, achieving leading performance on benchmarks with fewer training samples and offering an efficient variant with competitive accuracy.
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.