LGJun 26, 2025

Distilling Normalizing Flows

Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi

arXiv:2506.21003v17.12 citationsh-index: 72025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Incremental advance

AI Analysis

This work addresses the challenge of training efficient generative models for applications requiring high sampling quality, though it is incremental as it builds on existing normalizing flow architectures.

The paper tackles the problem of improving sampling quality and density estimation in smaller normalizing flow models by introducing novel knowledge distillation techniques, achieving substantial performance gains over non-distilled students with proportionally increased throughput.

Explicit density learners are becoming an increasingly popular technique for generative models because of their ability to better model probability distributions. They have advantages over Generative Adversarial Networks due to their ability to perform density estimation and having exact latent-variable inference. This has many advantages, including: being able to simply interpolate, calculate sample likelihood, and analyze the probability distribution. The downside of these models is that they are often more difficult to train and have lower sampling quality. Normalizing flows are explicit density models, that use composable bijective functions to turn an intractable probability function into a tractable one. In this work, we present novel knowledge distillation techniques to increase sampling quality and density estimation of smaller student normalizing flows. We seek to study the capacity of knowledge distillation in Compositional Normalizing Flows to understand the benefits and weaknesses provided by these architectures. Normalizing flows have unique properties that allow for a non-traditional forms of knowledge transfer, where we can transfer that knowledge within intermediate layers. We find that through this distillation, we can make students significantly smaller while making substantial performance gains over a non-distilled student. With smaller models there is a proportionally increased throughput as this is dependent upon the number of bijectors, and thus parameters, in the network.

View on arXiv PDF

Similar