CVAICLMMMay 19, 2023

TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding

arXiv:2305.11497v13 citations
Originality Incremental advance
AI Analysis

This addresses the explainability problem in visual grounding for researchers and practitioners, offering an incremental improvement over existing prompt tuning methods.

The paper tackles the poor interpretability of prompt tuning in visual grounding by proposing TreePrompt, a method that deconstructs sentences into trees and composes prompts step-by-step, achieving strong performance across multiple backbones and benchmarks.

Prompt tuning has achieved great success in transferring the knowledge from large pretrained vision-language models into downstream tasks, and has dominated the performance on visual grounding (VG). However, almost all existing prompt tuning paradigms suffer from poor interpretability. In this paper, we argue that their poor interpretability is attributed to the holistic prompt generation and inference process. By "holistic", we mean that they usually directly learn a set of vectors as the prompt (i.e., prompt generation), and use the learned global prompt to augment the textual input for the VG model (i.e., prompt inference). To this end, we propose a new prompt construction paradigm with explicit explainable ability, named TreePrompt. Specifically, we first deconstruct a complex sentence into a tree, that is consistent with human reasoning. Then, following the syntax tree, we compose a structured prompt in a bottom-up manner. Thanks to this step-by-step prompt construction process, each intermediate prompt (i.e., tree node) permits us to understand the reasoning process. Extensive ablations on various backbones and benchmarks consistently demonstrate the effectiveness and interpretability of our TreePrompt.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes