CVLGJun 3

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

arXiv:2606.0436417.4
Predicted impact top 92% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For fine-grained recognition tasks, this method improves interpretability by spatially grounding concept predictions without requiring expensive per-image keypoint annotations.

This work introduces a part-factorized concept bottleneck model that forces concept heads to attend only to their designated body regions, achieving 88.85% top-1 accuracy on CUB-200-2011 (matching a fully supervised baseline) while improving pointing accuracy from 36.4% to 52.6% with spatial priors, and to 70% with PCA foreground targets without per-image supervision.

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes