15.5LGMay 17
t-gems: text-guided exit modules for decreasing clip image encoderAlberto Presta, Grzegorz Stefanski, Michal Byra et al.
Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.
38.2CVMay 11
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image RecognitionMichal Byra, Pawel Olszowiec, Grzegorz Stefanski et al.
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
AIJan 29
Routing the Lottery: Adaptive Subnetworks for Heterogeneous DataGrzegorz Stefanski, Alberto Presta, Michal Byra
In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.