Caveats for information bottleneck in deterministic scenarios
This work highlights critical limitations of the information bottleneck framework for researchers in machine learning and information theory, particularly in deterministic settings like classification, and is incremental as it builds on existing IB theory to identify and address specific issues.
The paper tackles the problem of using the information bottleneck (IB) method in scenarios where the output Y is a deterministic function of the input X, such as in classification tasks, and demonstrates three caveats: the IB curve cannot be recovered via the IB Lagrangian, trivial solutions exist, and strict trade-offs between compression and prediction are not possible in multi-layer classifiers. It proposes a new functional to address the first caveat and validates findings on the MNIST dataset.
Information bottleneck (IB) is a method for extracting information from one random variable $X$ that is relevant for predicting another random variable $Y$. To do so, IB identifies an intermediate "bottleneck" variable $T$ that has low mutual information $I(X;T)$ and high mutual information $I(Y;T)$. The "IB curve" characterizes the set of bottleneck variables that achieve maximal $I(Y;T)$ for a given $I(X;T)$, and is typically explored by maximizing the "IB Lagrangian", $I(Y;T) - βI(X;T)$. In some cases, $Y$ is a deterministic function of $X$, including many classification problems in supervised learning where the output class $Y$ is a deterministic function of the input $X$. We demonstrate three caveats when using IB in any situation where $Y$ is a deterministic function of $X$: (1) the IB curve cannot be recovered by maximizing the IB Lagrangian for different values of $β$; (2) there are "uninteresting" trivial solutions at all points of the IB curve; and (3) for multi-layer classifiers that achieve low prediction error, different layers cannot exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. We also show that when $Y$ is a small perturbation away from being a deterministic function of $X$, these three caveats arise in an approximate way. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the three caveats on the MNIST dataset.