CV CLSep 22, 2025

WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification

Yiwen Jiang, Deval Mehta, Siyuan Yan, Yaling Shen, Zimu Wang, Zongyuan Ge

arXiv:2509.17740v110.24 citationsh-index: 10EMNLP

Originality Incremental advance

AI Analysis

This work addresses the need for better interpretability and fine-grained visual understanding in multimodal LLMs, though it is incremental as it builds on existing concept-based and MCoT methods.

The paper tackled the problem of multimodal LLMs lacking intra-object understanding in image classification by proposing WISE, a method that generates step-by-step explanations from concept-based representations under weak supervision, resulting in a 37% improvement in interpretability and gains in classification accuracy.

Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.

View on arXiv PDF

Similar