LGITSPOct 13, 2020

Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence

arXiv:2010.09484v418 citations
AI Analysis

This work provides theoretical guarantees for transfer learning practitioners by addressing domain shift issues, though it is incremental as it builds on existing divergence-based bounds.

This paper tackles the problem of bounding the transfer generalization gap in transfer learning by introducing information-theoretic upper bounds based on a generalized Jensen-Shannon divergence, which remain valid even when source and target distributions have non-overlapping supports and for unbounded loss functions, unlike prior KL divergence-based bounds.

In transfer learning, training and testing data sets are drawn from different data distributions. The transfer generalization gap is the difference between the population loss on the target data distribution and the training loss. The training data set generally includes data drawn from both source and target distributions. This work presents novel information-theoretic upper bounds on the average transfer generalization gap that capture $(i)$ the domain shift between the target data distribution $P'_Z$ and the source distribution $P_Z$ through a two-parameter family of generalized $(α_1,α_2)$-Jensen-Shannon (JS) divergences; and $(ii)$ the sensitivity of the transfer learner output $W$ to each individual sample of the data set $Z_i$ via the mutual information $I(W;Z_i)$. For $α_1 \in (0,1)$, the $(α_1,α_2)$-JS divergence can be bounded even when the support of $P_Z$ is not included in that of $P'_Z$. This contrasts the Kullback-Leibler (KL) divergence $D_{KL}(P_Z||P'_Z)$-based bounds of Wu et al. [1], which are vacuous under this assumption. Moreover, the obtained bounds hold for unbounded loss functions with bounded cumulant generating functions, unlike the $φ$-divergence based bound of Wu et al. [1]. We also obtain new upper bounds on the average transfer excess risk in terms of the $(α_1,α_2)$-JS divergence for empirical weighted risk minimization (EWRM), which minimizes the weighted average training losses over source and target data sets. Finally, we provide a numerical example to illustrate the merits of the introduced bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes