CVNov 27, 2024

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

arXiv:2411.18363v328 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This addresses the limitation of MLLMs in accurate perception for computer vision tasks, though it is incremental as it builds on existing MLLM frameworks.

The paper tackles the perception gap in multimodal large language models (MLLMs) by introducing ChatRex, which uses a decoupled design and a new dataset to improve joint perception and understanding, achieving strong performance and enabling new applications.

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After a three-stage training approach, ChatRex demonstrates strong perception and understanding performance, and the combination of these two capabilities also unlocks many attractive applications, demonstrating their complementary roles in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes