CVSep 10, 2025

Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs

arXiv:2509.09732v1h-index: 4
Originality Synthesis-oriented
AI Analysis

This addresses the problem of enhancing interpretability and performance for fine-grained visual classification in VLMs, though the results are incremental as tree-based reasoning did not outperform existing methods.

The paper investigated whether structured tree-based reasoning could improve vision language model performance on visual classification tasks, finding that it consistently underperformed standard zero-shot prompting despite achieving 98.2% accuracy in understanding tree knowledge.

Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes