CVNov 27, 2024

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang

arXiv:2411.18363v319.828 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This addresses the limitation of MLLMs in accurate perception for computer vision tasks, though it is incremental as it builds on existing MLLM frameworks.

The paper tackles the perception gap in multimodal large language models (MLLMs) by introducing ChatRex, which uses a decoupled design and a new dataset to improve joint perception and understanding, achieving strong performance and enabling new applications.

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After a three-stage training approach, ChatRex demonstrates strong perception and understanding performance, and the combination of these two capabilities also unlocks many attractive applications, demonstrating their complementary roles in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

View on arXiv PDF Code

Similar