ITJul 12, 2022
On the Generalization for Transfer Learning: An Information-Theoretic AnalysisXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(μ\|μ')$ plays an important role in the characterizations where $μ$ and $μ'$ denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as $φ$-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when $μ$ is not absolutely continuous with respect to $μ'$. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.
ITMay 6, 2022
Fast Rate Generalization Error Bounds: Variations on a ThemeXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
A recent line of works, initiated by Russo and Xu, has shown that the generalization error of a learning algorithm can be upper bounded by information measures. In most of the relevant works, the convergence rate of the expected generalization error is in the form of O(sqrt{lambda/n}) where lambda is some information-theoretic quantities such as the mutual information between the data sample and the learned hypothesis. However, such a learning rate is typically considered to be "slow", compared to a "fast rate" of O(1/n) in many learning scenarios. In this work, we first show that the square root does not necessarily imply a slow rate, and a fast rate (O(1/n)) result can still be obtained using this bound under appropriate assumptions. Furthermore, we identify the key conditions needed for the fast rate generalization error, which we call the (eta,c)-central condition. Under this condition, we give information-theoretic bounds on the generalization error and excess risk, with a convergence rate of O(λ/{n}) for specific learning algorithms such as empirical risk minimization. Finally, analytical examples are given to show the effectiveness of the bounds.
LGMay 10, 2022
On Causality in Domain Adaptation and Semi-Supervised Learning: an Information-Theoretic Analysis for Parametric ModelsXuetong Wu, Mingming Gong, Jonathan H. Manton et al.
Recent advancements in unsupervised domain adaptation (UDA) and semi-supervised learning (SSL), particularly incorporating causality, have led to significant methodological improvements in these learning problems. However, a formal theory that explains the role of causality in the generalization performance of UDA/SSL is still lacking. In this paper, we consider the UDA/SSL scenarios where we access $m$ labelled source data and $n$ unlabelled target data as training instances under different causal settings with a parametric probabilistic model. We study the learning performance (e.g., excess risk) of prediction in the target domain from an information-theoretic perspective. Specifically, we distinguish two scenarios: the learning problem is called causal learning if the feature is the cause and the label is the effect, and is called anti-causal learning otherwise. We show that in causal learning, the excess risk depends on the size of the source sample at a rate of $O(\frac{1}{m})$ only if the labelling distribution between the source and target domains remains unchanged. In anti-causal learning, we show that the unlabelled data dominate the performance at a rate of typically $O(\frac{1}{n})$. These results bring out the relationship between the data sample size and the hardness of the learning problem with different causal mechanisms.
ITMar 26, 2023
Fast Rate Information-theoretic Bounds on Generalization ErrorsXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
The generalization error of a learning algorithm refers to the discrepancy between the loss of a learning algorithm on training data and that on unseen testing data. Various information-theoretic bounds on the generalization error have been derived in the literature, where the mutual information between the training data and the hypothesis (the output of the learning algorithm) plays an important role. Focusing on the individual sample mutual information bound by Bu et al., which itself is a tightened version of the first bound on the topic by Russo et al. and Xu et al., this paper investigates the tightness of these bounds, in terms of the dependence of their convergence rates on the sample size $n$. It has been recognized that these bounds are in general not tight, readily verified for the exemplary quadratic Gaussian mean estimation problem, where the individual sample mutual information bound scales as $O(\sqrt{1/n})$ while the true generalization error scales as $O(1/n)$. The first contribution of this paper is to show that the same bound can in fact be asymptotically tight if an appropriate assumption is made. In particular, we show that the fast rate can be recovered when the assumption is made on the excess risk instead of the loss function, which was usually done in existing literature. A theoretical justification is given for this choice. The second contribution of the paper is a new set of generalization error bounds based on the $(η, c)$-central condition, a condition relatively easy to verify and has the property that the mutual information term directly determines the convergence rate of the bound. Several analytical and numerical examples are given to show the effectiveness of these bounds.
CLAug 29, 2024
An Interpretable and Crosslingual Method for Evaluating Second-Language DialoguesRena Gao, Jingxuan Wu, Xuetong Wu et al.
We analyse the cross-lingual transferability of a dialogue evaluation framework that assesses the relationships between micro-level linguistic features (e.g. backchannels) and macro-level interactivity labels (e.g. topic management), originally designed for English-as-a-second-language dialogues. To this end, we develop CNIMA (Chinese Non-Native Interactivity Measurement and Automation), a Chinese-as-a-second-language labelled dataset with 10K dialogues. We found the evaluation framework to be robust across distinct languages: English and Chinese, revealing language-specific and language-universal relationships between micro-level and macro-level features. Next, we propose an automated, interpretable approach with low data requirement that scores the overall quality of a second-language dialogue based on the framework. Our approach is interpretable in that it reveals the key linguistic and interactivity features that contributed to the overall quality score. As our approach does not require labelled data, it can also be adapted to other languages for second-language dialogue evaluation.
LGMar 9, 2023
On the Value of Stochastic Side Information in Online LearningJunzhang Jia, Xuetong Wu, Jingge Zhu et al.
We study the effectiveness of stochastic side information in deterministic online learning scenarios. We propose a forecaster to predict a deterministic sequence where its performance is evaluated against an expert class. We assume that certain stochastic side information is available to the forecaster but not the experts. We define the minimax expected regret for evaluating the forecasters performance, for which we obtain both upper and lower bounds. Consequently, our results characterize the improvement in the regret due to the stochastic side information. Compared with the classical online learning problem with regret scales with O(\sqrt(n)), the regret can be negative when the stochastic side information is more powerful than the experts. To illustrate, we apply the proposed bounds to two concrete examples of different types of side information.
CLFeb 20, 2025
Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent BiasesRena Gao, Xuetong Wu, Tatsuki Kuribayashi et al.
This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.
CLOct 31, 2024
'No' Matters: Out-of-Distribution Detection in Multimodality Long DialogueRena Gao, Xuetong Wu, Siwen Luo et al.
Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.
LGSep 3, 2021
A Bayesian Approach to (Online) Transfer Learning: Theory and AlgorithmsXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
Transfer learning is a machine learning paradigm where knowledge from one problem is utilized to solve a new but related problem. While conceivable that knowledge from one task could be useful for solving a related task, if not executed properly, transfer learning algorithms can impair the learning performance instead of improving it -- commonly known as negative transfer. In this paper, we study transfer learning from a Bayesian perspective, where a parametric statistical model is used. Specifically, we study three variants of transfer learning problems, instantaneous, online, and time-variant transfer learning. For each problem, we define an appropriate objective function, and provide either exact expressions or upper bounds on the learning performance using information-theoretic quantities, which allow simple and explicit characterizations when the sample size becomes large. Furthermore, examples show that the derived bounds are accurate even for small sample sizes. The obtained bounds give valuable insights into the effect of prior knowledge for transfer learning, at least with respect to our Bayesian formulation of the transfer learning problem. In particular, we formally characterize the conditions under which negative transfer occurs. Lastly, we devise two (online) transfer learning algorithms that are amenable to practical implementations, one of which does not require the parametric assumption. We demonstrate the effectiveness of our algorithms with real data sets, focusing primarily on when the source and target data have strong similarities.
LGMay 4, 2021
Online Transfer Learning: Negative Transfer and Effect of Prior KnowledgeXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
Transfer learning is a machine learning paradigm where the knowledge from one task is utilized to resolve the problem in a related task. On the one hand, it is conceivable that knowledge from one task could be useful for solving a related problem. On the other hand, it is also recognized that if not executed properly, transfer learning algorithms could in fact impair the learning performance instead of improving it - commonly known as "negative transfer". In this paper, we study the online transfer learning problems where the source samples are given in an offline way while the target samples arrive sequentially. We define the expected regret of the online transfer learning problem and provide upper bounds on the regret using information-theoretic quantities. We also obtain exact expressions for the bounds when the sample size becomes large. Examples show that the derived bounds are accurate even for small sample sizes. Furthermore, the obtained bounds give valuable insight on the effect of prior knowledge for transfer learning in our formulation. In particular, we formally characterize the conditions under which negative transfer occurs.
LGDec 1, 2020
Transfer learning to enhance amenorrhea status prediction in cancer and fertility data with missing valuesXuetong Wu, Hadi Akbarzadeh Khorshidi, Uwe Aickelin et al.
Collecting sufficient labelled training data for health and medical problems is difficult (Antropova, et al., 2018). Also, missing values are unavoidable in health and medical datasets and tackling the problem arising from the inadequate instances and missingness is not straightforward (Snell, et al. 2017, Sterne, et al. 2009). However, machine learning algorithms have achieved significant success in many real-world healthcare problems, such as regression and classification and these techniques could possibly be a way to resolve the issues.
LGNov 16, 2020
Imputation techniques on missing values in breast cancer treatment and fertility dataXuetong Wu, Hadi Akbarzadeh Khorshidi, Uwe Aickelin et al.
Clinical decision support using data mining techniques offers more intelligent way to reduce the decision error in the last few years. However, clinical datasets often suffer from high missingness, which adversely impacts the quality of modelling if handled improperly. Imputing missing values provides an opportunity to resolve the issue. Conventional imputation methods adopt simple statistical analysis, such as mean imputation or discarding missing cases, which have many limitations and thus degrade the performance of learning. This study examines a series of machine learning based imputation methods and suggests an efficient approach to in preparing a good quality breast cancer (BC) dataset, to find the relationship between BC treatment and chemotherapy-related amenorrhoea, where the performance is evaluated with the accuracy of the prediction.
LGMay 18, 2020
Information-theoretic analysis for transfer learningXuetong Wu, Jonathan H. Manton, Uwe Aickelin et al.
Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different distributions (denoted as $μ$ and $μ'$, respectively). In this work, we give an information-theoretic analysis on the generalization error and the excess risk of transfer learning algorithms, following a line of work initiated by Russo and Zhou. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(mu||mu')$ plays an important role in characterizing the generalization error in the settings of domain adaptation. Specifically, we provide generalization error upper bounds for general transfer learning algorithms and extend the results to a specific empirical risk minimization (ERM) algorithm where data from both distributions are available in the training phase. We further apply the method to iterative, noisy gradient descent algorithms, and obtain upper bounds which can be easily calculated, only using parameters from the learning algorithms. A few illustrative examples are provided to demonstrate the usefulness of the results. In particular, our bound is tighter in specific classification problems than the bound derived using Rademacher complexity.