CVAISep 14, 2025

Geometrically Constrained and Token-Based Probabilistic Spatial Transformers

arXiv:2509.11218v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses geometric sensitivity in fine-grained visual classification, though it appears incremental as an extension of existing STN frameworks.

The paper tackles fine-grained visual classification under geometric variability by proposing a probabilistic, component-wise extension of Spatial Transformer Networks that decomposes affine transformations into constrained components with uncertainty modeling. Experiments on moth classification benchmarks show consistent robustness improvements over other STN methods.

Fine-grained visual classification (FGVC) remains highly sensitive to geometric variability, where objects appear under arbitrary orientations, scales, and perspective distortions. While equivariant architectures address this issue, they typically require substantial computational resources and restrict the hypothesis space. We revisit Spatial Transformer Networks (STNs) as a canonicalization tool for transformer-based vision pipelines, emphasizing their flexibility, backbone-agnostic nature, and lack of architectural constraints. We propose a probabilistic, component-wise extension that improves robustness. Specifically, we decompose affine transformations into rotation, scaling, and shearing, and regress each component under geometric constraints using a shared localization encoder. To capture uncertainty, we model each component with a Gaussian variational posterior and perform sampling-based canonicalization during inference.A novel component-wise alignment loss leverages augmentation parameters to guide spatial alignment. Experiments on challenging moth classification benchmarks demonstrate that our method consistently improves robustness compared to other STNs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes