CV AI CL LGJan 25, 2025

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng

arXiv:2501.15140v326.129 citationsh-index: 7Has CodeICLR

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in MLLMs for tasks like object-centric visual question answering, representing an incremental improvement in domain-specific capabilities.

The paper tackles the problem of fine-grained visual recognition (FGVR) in multi-modal large language models (MLLMs), which struggle with identifying subordinate-level categories, and presents Finedefics, an MLLM that improves FGVR by incorporating attribute descriptions and contrastive learning, outperforming existing models of comparable sizes on multiple datasets.

Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.

View on arXiv PDF Code

Similar