LGOCNov 6, 2025

TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

arXiv:2511.03983v11 citationsh-index: 11
Originality Highly original
AI Analysis

This addresses the challenge of deploying sparse LLMs efficiently for practical inference on commodity hardware, offering a training-time solution without post-processing overhead.

The paper tackles the problem of efficiently sparsifying large language models (LLMs) by introducing TwIST, a distributed training framework that identifies high-quality subnetworks during training, enabling zero-cost pruning at deployment with competitive perplexity, such as achieving 23.14 PPL compared to 31.64 for prior methods at high sparsity.

We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes