IT LGJul 12, 2022

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Xuetong Wu, Jonathan H. Manton, Uwe Aickelin, Jingge Zhu

arXiv:2207.05377v28.021 citationsh-index: 48

Originality Incremental advance

AI Analysis

This work addresses the theoretical understanding of generalization in transfer learning, which is crucial for machine learning applications where training and testing data distributions differ, though it is incremental in extending existing bounds and algorithms.

The paper provides an information-theoretic analysis of generalization error and excess risk in transfer learning, showing that KL divergence and other divergences play key roles in bounding these errors, and introduces an algorithm (InfoBoost) that dynamically adjusts importance weights to improve practical applicability.

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(μ\|μ')$ plays an important role in the characterizations where $μ$ and $μ'$ denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as $φ$-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when $μ$ is not absolutely continuous with respect to $μ'$. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.

View on arXiv PDF

Similar