Data efficiency, dimensionality reduction, and the generalized symmetric information bottleneck
This addresses data efficiency for machine learning practitioners by providing a theoretical framework for more efficient dimensionality reduction, though it appears incremental as an extension of existing methods.
The paper tackles the problem of data efficiency in dimensionality reduction by introducing the Generalized Symmetric Information Bottleneck (GSIB), showing that simultaneous compression of two variables requires qualitatively less data to achieve the same errors compared to compressing them one at a time.
The Symmetric Information Bottleneck (SIB), an extension of the more familiar Information Bottleneck, is a dimensionality reduction technique that simultaneously compresses two random variables to preserve information between their compressed versions. We introduce the Generalized Symmetric Information Bottleneck (GSIB), which explores different functional forms of the cost of such simultaneous reduction. We then explore the dataset size requirements of such simultaneous compression. We do this by deriving bounds and root-mean-squared estimates of statistical fluctuations of the involved loss functions. We show that, in typical situations, the simultaneous GSIB compression requires qualitatively less data to achieve the same errors compared to compressing variables one at a time. We suggest that this is an example of a more general principle that simultaneous compression is more data efficient than independent compression of each of the input variables.