Disentangled Lottery Tickets: Identifying and Assembling Core and Specialist Subnetworks
This work addresses the computational inefficiency of finding winning tickets in neural networks for researchers and practitioners, offering a modular approach to pruning that is incremental but reframes the process.
The paper tackles the problem of identifying sparse subnetworks in neural networks by proposing the Disentangled Lottery Ticket (DiLT) Hypothesis, which separates a universal 'core' subnetwork from task-specific 'specialist' subnetworks, and shows that reassembling these components outperforms prior methods like COLT on ImageNet and fine-grained datasets.
The Lottery Ticket Hypothesis (LTH) suggests that within large neural networks, there exist sparse, trainable "winning tickets" capable of matching the performance of the full model, but identifying them through Iterative Magnitude Pruning (IMP) is computationally expensive. Recent work introduced COLT, an accelerator that discovers a "consensus" subnetwork by intersecting masks from models trained on disjoint data partitions; however, this approach discards all non-overlapping weights, assuming they are unimportant. This paper challenges that assumption and proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which posits that the intersection mask represents a universal, task-agnostic "core" subnetwork, while the non-overlapping difference masks capture specialized, task-specific "specialist" subnetworks. A framework is developed to identify and analyze these components using the Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and reveal modular structures through spectral clustering. Experiments on ImageNet and fine-grained datasets such as Stanford Cars, using ResNet and Vision Transformer architectures, show that the "core" ticket provides superior transfer learning performance, the "specialist" tickets retain domain-specific features enabling modular assembly, and the full re-assembled "union" ticket outperforms COLT - demonstrating that non-consensus weights play a critical functional role. This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models.