Dynamic Mode Decomposition along Depth in Vision Transformers
For researchers studying the internal representations of vision transformers, this work provides evidence of local linear dynamics but shows that these dynamics are not useful for downstream tasks, indicating a limitation of the linear approximation.
The paper tests whether Vision Transformer (ViT) depth implements approximately autonomous linear dynamics using Dynamic Mode Decomposition (DMD). They find that for short spans (p ≤ 4), the learned linear operator predicts hidden states with high cosine similarity (within 0.02 on DINOv3-H/16+), but this local linearity does not transfer to downstream tasks, where an identity baseline is competitive.
Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.