Guided Transfer Learning for Discrete Diffusion Models
This work addresses a practical problem for researchers and practitioners using discrete diffusion models in data-scarce scenarios, offering a novel transfer learning approach that is incremental in adapting continuous DM techniques to discrete domains.
The paper tackles the challenge of applying discrete diffusion models (DMs) in small-data regimes by introducing Guided Transfer Learning (GTL), a method that adapts pretrained DMs to target distributions without modifying the denoiser and reduces computational cost to linear scaling in vocabulary size, enabling longer sequence generation. Results show GTL is increasingly effective as target data shrinks, outperforming weight fine-tuning in such cases, but it fails when source and target distributions overlap poorly.
Discrete diffusion models (DMs) have achieved strong performance in language and other discrete domains, offering a compelling alternative to autoregressive modeling. Yet this performance typically depends on large training datasets, challenging the performance of DMs in small-data regimes -- common under real-world constraints. Aimed at this challenge, recent work in continuous DMs suggests that transfer learning via classifier ratio-based guidance can adapt a pretrained DM to a related target distribution, often outperforming alternatives such as full-weight fine-tuning on the target data. By contrast, transfer learning for discrete DMs remains unexplored. We address this gap by exploring practical analogues of ratio-based transfer learning for discrete DMs. Our theoretical analysis shows that a direct extension of existing ratio-based guidance is computationally prohibitive, scaling with vocabulary size. To overcome this limitation, we introduce a scheduling mechanism that yields a practical algorithm, Guided Transfer Learning for discrete diffusion models (GTL). GTL enables sampling from a target distribution without modifying the pretrained denoiser and reduces the cost to linear scaling in vocabulary size, which in turn supports longer sequence generation. We evaluate GTL on sequential data, including synthetic Markov chains and language modeling tasks, and provide a detailed empirical analysis of its behavior. The results highlight a clear trade-off: when target datasets are large, weight fine-tuning is often preferable, whereas GTL becomes increasingly effective as target data shrinks. Finally, we experimentally demonstrate a key failure mode of GTL: when the source and target distributions overlap poorly, the ratio-based classifier required for guidance becomes unreliable, limiting transfer performance.