The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers
This addresses the problem of interpretability and efficiency in ViTs for computer vision researchers, providing insights into how data influences representation learning, though it is incremental in building on prior observations.
The study tackled the problem of understanding why Vision Transformers (ViTs) spontaneously develop a U-shaped entropy profile, showing it is a data-dependent adaptation rather than an architectural artifact, with the bottleneck depth correlating strongly with the semantic abstraction required by the task, such as object-centric datasets driving compression to isolate features.
Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.