MLFeb 18
Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural NetworksBinchuan Qi
In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.
LGMar 29, 2025
Towards Understanding the Optimization Mechanisms in Deep LearningBinchuan Qi, Wei Gong, Li Li
In this paper, we adopt a probability distribution estimation perspective to explore the optimization mechanisms of supervised classification using deep neural networks. We demonstrate that, when employing the Fenchel-Young loss, despite the non-convex nature of the fitting error with respect to the model's parameters, global optimal solutions can be approximated by simultaneously minimizing both the gradient norm and the structural error. The former can be controlled through gradient descent algorithms. For the latter, we prove that it can be managed by increasing the number of parameters and ensuring parameter independence, thereby providing theoretical insights into mechanisms such as over-parameterization and random initialization. Ultimately, the paper validates the key conclusions of the proposed method through empirical results, illustrating its practical effectiveness.
LGJun 9, 2024
Probability Distribution Learning and Its Application in Deep LearningBinchuan Qi, Wei Gong, Li Li
Despite its empirical success, deep learning still lacks a comprehensive theoretical understanding of model fitting and generalization. This paper proposes the probability distribution (PD) learning framework to analyze the optimization and generalization mechanisms of deep learning. Within this framework, the conditional distribution of labels given features is the primary learning target, with the loss function, prior knowledge, and model properties explicitly characterized. Under these formulations, we establish theoretical guarantees on optimizability, even in non-convex settings, and derive generalization error bounds that provide meaningful explanations for practical performance. Specifically, we first prove theoretically that the Fenchel-Young loss is the natural and necessary choice for solving PD learning problems, thereby justifying the generality of conclusions based on this loss. Second, to capture the characteristics of deep neural networks (DNNs), we introduce the notions of $\mathcal{H}(ψ)$-convexity and $\mathcal{H}(Ψ)$-smoothness, which generalize the classical concepts of strong convexity and Lipschitz smoothness. Based on them, we provide a theoretical explanation for the effectiveness of SGD in training DNNs. Finally, we derive model-independent bounds on the expected risk and generalization error for trained models, revealing the influence of the training set size, regularization term, the mutual information between labels and features, and the information loss caused by model irreversibility on risk and generalization. Based on our theoretical analysis and experimental validation, we believe that the PD learning framework facilitates a deeper and more unified theoretical understanding of deep learning.
LGJun 7, 2024
Error Bounds of Supervised Classification from Information-Theoretic PerspectiveBinchuan Qi
In this paper, we explore bounds on the expected risk when using deep neural networks for supervised classification from an information theoretic perspective. Firstly, we introduce model risk and fitting error, which are derived from further decomposing the empirical risk. Model risk represents the expected value of the loss under the model's predicted probabilities and is exclusively dependent on the model. Fitting error measures the disparity between the empirical risk and model risk. Then, we derive the upper bound on fitting error, which links the back-propagated gradient and the model's parameter count with the fitting error. Furthermore, we demonstrate that the generalization errors are bounded by the classification uncertainty, which is characterized by both the smoothness of the distribution and the sample size. Based on the bounds on fitting error and generalization, by utilizing the triangle inequality, we establish an upper bound on the expected risk. This bound is applied to provide theoretical explanations for overparameterization, non-convex optimization and flat minima in deep learning. Finally, empirical verification confirms a significant positive correlation between the derived theoretical bounds and the practical expected risk, thereby affirming the practical relevance of the theoretical findings.