Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query AttacksSizhe Chen, Zhehao Huang, Qinghua Tao et al.
The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure 80.59% accuracy under Square attack (2500 queries), while the best prior defense (i.e., adversarial training) only attains 67.44%. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets, bounds, norms, losses, and strategies. Moreover, AAA calibrates better without hurting the accuracy. Our code is available at https://github.com/Sizhe-Chen/AAA.
Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMsJiani Liu, Qinghua Tao, Ce Zhu et al.
Multitask learning (MTL) leverages task-relatedness to enhance performance. With the emergence of multimodal data, tasks can now be referenced by multiple indices. In this paper, we employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices and preserve their structural relations. Based on this representation, we propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs), where the CP factorization is deployed over the coefficient tensor. Our approach allows to model the task relation through a linear combination of shared factors weighted by task-specific factors and is generalized to both classification and regression problems. Through the alternating optimization scheme and the Lagrangian function, each subproblem is transformed into a convex problem, formulated as a quadratic programming or linear system in the dual form. In contrast to previous MTL frameworks, our decision function in the dual induces a weighted kernel function with a task-coupling term characterized by the similarities of the task-specific factors, better revealing the explicit relations across tasks in MTL. Experimental results validate the effectiveness and superiority of our proposed methods compared to existing state-of-the-art approaches in MTL. The code of implementation will be available at https://github.com/liujiani0216/TSVM-MTL.
Efficient Generalization Improvement Guided by Random Weight PerturbationTao Li, Weihao Yan, Zehao Lei et al.
To fully uncover the great potential of deep neural networks (DNNs), various learning algorithms have been developed to improve the model's generalization ability. Recently, sharpness-aware minimization (SAM) establishes a generic scheme for generalization improvements by minimizing the sharpness measure within a small neighborhood and achieves state-of-the-art performance. However, SAM requires two consecutive gradient evaluations for solving the min-max problem and inevitably doubles the training time. In this paper, we resort to filter-wise random weight perturbations (RWP) to decouple the nested gradients in SAM. Different from the small adversarial perturbations in SAM, RWP is softer and allows a much larger magnitude of perturbations. Specifically, we jointly optimize the loss function with random perturbations and the original loss function: the former guides the network towards a wider flat region while the latter helps recover the necessary local information. These two loss terms are complementary to each other and mutually independent. Hence, the corresponding gradients can be efficiently computed in parallel, enabling nearly the same training speed as regular training. As a result, we achieve very competitive performance on CIFAR and remarkably better performance on ImageNet (e.g. $\mathbf{ +1.1\%}$) compared with SAM, but always require half of the training time. The code is released at https://github.com/nblt/RWP.
Self-Ensemble Protection: Training Checkpoints Are Good Data ProtectorsSizhe Chen, Geng Yuan, Xinwen Cheng et al.
As data becomes increasingly vital, a company would be very cautious about releasing data, because the competitors could use it to train high-performance models, thereby posing a tremendous threat to the company's commercial competence. To prevent training good models on the data, we could add imperceptible perturbations to it. Since such perturbations aim at hurting the entire training process, they should reflect the vulnerability of DNN training, rather than that of a single model. Based on this new idea, we seek perturbed examples that are always unrecognized (never correctly classified) in training. In this paper, we uncover them by model checkpoints' gradients, forming the proposed self-ensemble protection (SEP), which is very effective because (1) learning on examples ignored during normal training tends to yield DNNs ignoring normal examples; (2) checkpoints' cross-model gradients are close to orthogonal, meaning that they are as diverse as DNNs with different architectures. That is, our amazing performance of ensemble only requires the computation of training one model. By extensive experiments with 9 baselines on 3 datasets and 5 architectures, SEP is verified to be a new state-of-the-art, e.g., our small $\ell_\infty=2/255$ perturbations reduce the accuracy of a CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known method. Code is available at https://github.com/Sizhe-Chen/SEP.
11.1LGJun 18, 2022
Piecewise Linear Neural Networks and Deep LearningQinghua Tao, Li Li, Xiaolin Huang et al.
As a powerful modelling method, PieceWise Linear Neural Networks (PWLNNs) have proven successful in various fields, most recently in deep learning. To apply PWLNN methods, both the representation and the learning have long been studied. In 1977, the canonical representation pioneered the works of shallow PWLNNs learned by incremental designs, but the applications to large-scale data were prohibited. In 2010, the Rectified Linear Unit (ReLU) advocated the prevalence of PWLNNs in deep learning. Ever since, PWLNNs have been successfully applied to extensive tasks and achieved advantageous performances. In this Primer, we systematically introduce the methodology of PWLNNs by grouping the works into shallow and deep networks. Firstly, different PWLNN representation models are constructed with elaborated examples. With PWLNNs, the evolution of learning algorithms for data is presented and fundamental theoretical analysis follows up for in-depth understandings. Then, representative applications are introduced together with discussions and outlooks.
Unifying Gradients to Improve Real-world Robustness for Deep NetworksYingwen Wu, Sizhe Chen, Kun Fang et al.
The wide application of deep neural networks (DNNs) demands an increasing amount of attention to their real-world robustness, i.e., whether a DNN resists black-box adversarial attacks, among which score-based query attacks (SQAs) are most threatening since they can effectively hurt a victim network with the only access to model outputs. Defending against SQAs requires a slight but artful variation of outputs due to the service purpose for users, who share the same output information with SQAs. In this paper, we propose a real-world defense by Unifying Gradients (UniG) of different data so that SQAs could only probe a much weaker attack direction that is similar for different samples. Since such universal attack perturbations have been validated as less aggressive than the input-specific perturbations, UniG protects real-world DNNs by indicating attackers a twisted and less informative attack direction. We implement UniG efficiently by a Hadamard product module which is plug-and-play. According to extensive experiments on 5 SQAs, 2 adaptive attacks and 7 defense baselines, UniG significantly improves real-world robustness without hurting clean accuracy on CIFAR10 and ImageNet. For instance, UniG maintains a model of 77.80% accuracy under 2500-query Square attack while the state-of-the-art adversarially-trained model only has 67.34% on CIFAR10. Simultaneously, UniG outperforms all compared baselines in terms of clean accuracy and achieves the smallest modification of the model output. The code is released at https://github.com/snowien/UniG-pytorch.
One-Pixel Shortcut: on the Learning Preference of Deep Neural NetworksShutong Wu, Sizhe Chen, Cihang Xie et al.
Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Existing work adds $\ell_\infty$-bounded perturbations to the original sample so that the trained model generalizes poorly. Such perturbations, however, are easy to eliminate by adversarial training and data augmentations. In this paper, we resolve this problem from a novel perspective by perturbing only one pixel in each image. Interestingly, such a small modification could effectively degrade model accuracy to almost an untrained counterpart. Moreover, our produced \emph{One-Pixel Shortcut (OPS)} could not be erased by adversarial training and strong augmentations. To generate OPS, we perturb in-class images at the same position to the same target value that could mostly and stably deviate from all the original images. Since such generation is only based on images, OPS needs significantly less computation cost than the previous methods using DNN generators. Based on OPS, we introduce an unlearnable dataset called CIFAR-10-S, which is indistinguishable from CIFAR-10 by humans but induces the trained model to extremely low accuracy. Even under adversarial training, a ResNet-18 trained on CIFAR-10-S has only 10.61% accuracy, compared to 83.02% by the existing error-minimizing method.
Enhancing Kernel Flexibility via Learning Asymmetric Locally-Adaptive KernelsFan He, Mingzhen He, Lei Shi et al.
The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concept of Locally-Adaptive-Bandwidths (LAB) as trainable parameters to enhance the Radial Basis Function (RBF) kernel, giving rise to the LAB RBF kernel. The parameters in LAB RBF kernels are data-dependent, and its number can increase with the dataset, allowing for better adaptation to diverse data patterns and enhancing the flexibility of the learned function. This newfound flexibility also brings challenges, particularly with regards to asymmetry and the need for an efficient learning algorithm. To address these challenges, this paper for the first time establishes an asymmetric kernel ridge regression framework and introduces an iterative kernel learning algorithm. This novel approach not only reduces the demand for extensive support data but also significantly improves generalization by training bandwidths on the available training data. Experimental results on real datasets underscore the remarkable performance of the proposed algorithm, showcasing its superior capability in handling large-scale datasets compared to Nyström approximation-based algorithms. Moreover, it demonstrates a significant improvement in regression accuracy over existing kernel-based learning methods and even surpasses residual neural networks.
Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training StrategyTao Li, Weisen Jiang, Fanghui Liu et al.
Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experimental results on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also reduces memory usage by more than $13\times$. Moreover, MEHL-Soup(+) can be run on a single GPU and achieves $9\times$ speed up in soup construction compared with the Learned-Soup. The code is released at https://github.com/nblt/MEHL-Soup.
Trainable Weight Averaging: Accelerating Training and Improving GeneralizationTao Li, Zhehao Huang, Yingwen Wu et al.
Weight averaging is a widely used technique for accelerating training and improving the generalization of deep neural networks (DNNs). While existing approaches like stochastic weight averaging (SWA) rely on pre-set weighting schemes, they can be suboptimal when handling diverse weights. We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights and learns optimal weighting coefficients through optimization. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression for the projection matrix, effectively managing memory and computational demands. TWA can be implemented using either training data (TWA-t) or validation data (TWA-v), with the latter providing more effective averaging. Extensive experiments showcase TWA's advantages: (i) it consistently outperforms SWA in generalization performance and flexibility, (ii) when applied during early training, it reduces training time by over 40\% on CIFAR datasets and 30\% on ImageNet while maintaining comparable performance, and (iii) during fine-tuning, it significantly enhances generalization by weighted averaging of model checkpoints. In summary, we present an efficient and effective framework for trainable weight averaging. The code is available at https://github.com/nblt/TWA.
3.7CVSep 27, 2022
FG-UAP: Feature-Gathering Universal Adversarial PerturbationZhixing Ye, Xinwen Cheng, Xiaolin Huang
Deep Neural Networks (DNNs) are susceptible to elaborately designed perturbations, whether such perturbations are dependent or independent of images. The latter one, called Universal Adversarial Perturbation (UAP), is very attractive for model robustness analysis, since its independence of input reveals the intrinsic characteristics of the model. Relatively, another interesting observation is Neural Collapse (NC), which means the feature variability may collapse during the terminal phase of training. Motivated by this, we propose to generate UAP by attacking the layer where NC phenomenon happens. Because of NC, the proposed attack could gather all the natural images' features to its surrounding, which is hence called Feature-Gathering UAP (FG-UAP). We evaluate the effectiveness our proposed algorithm on abundant experiments, including untargeted and targeted universal attacks, attacks under limited dataset, and transfer-based black-box attacks among different architectures including Vision Transformers, which are believed to be more robust. Furthermore, we investigate FG-UAP in the view of NC by analyzing the labels and extracted features of adversarial examples, finding that collapse phenomenon becomes stronger after the model is corrupted. The code will be released when the paper is accepted.
7.8OCApr 22
Robust Out-of-Distribution Stochastic OptimizationXianyu Li, Huan Xu, Xiaolin Huang et al.
Data-driven decision-making under uncertainty typically presumes the collection of historical data from an unknown target probability distribution. However, one may have no access to any data from the target distribution prior to decision-making. To address this challenge, we propose robust out-of-distribution stochastic optimization, a novel data-driven framework that effectively utilizes relevant data distributions for robust decision-making under unseen distributions. A key feature of our framework is that all data distributions are assumed to be randomly generated from a meta-distribution over distributions. To describe uncertainty in distribution generation, we propose to learn a data-driven uncertainty set in a reproducing kernel Hilbert space (RKHS) from relevant data distributions, with adjustable conservatism. We then incorporate this set into a min-max stochastic program to derive robust decisions. Notably, under randomness of distribution generation, we establish rigorous out-of-distribution generalization guarantees for the uncertainty set as well as the solution. To ease problem-solving in RKHS, an approximate parametrization with a provably bounded suboptimality and a row generation strategy are presented. Extensive numerical experiments on multi-item newsvendor and portfolio optimization demonstrate the superior out-of-distribution performance of our decision-making framework under unseen data distribution, even when only a small or moderate number of relevant sources are available.
Revisiting Deep Ensemble for Out-of-Distribution Detection: A Loss Landscape PerspectiveKun Fang, Qinghua Tao, Xiaolin Huang et al.
Existing Out-of-Distribution (OoD) detection methods address to detect OoD samples from In-Distribution (InD) data mainly by exploring differences in features, logits and gradients in Deep Neural Networks (DNNs). We in this work propose a new perspective upon loss landscape and mode ensemble to investigate OoD detection. In the optimization of DNNs, there exist many local optima in the parameter space, or namely modes. Interestingly, we observe that these independent modes, which all reach low-loss regions with InD data (training and test data), yet yield significantly different loss landscapes with OoD data. Such an observation provides a novel view to investigate the OoD detection from the loss landscape, and further suggests significantly fluctuating OoD detection performance across these modes. For instance, FPR values of the RankFeat method can range from 46.58% to 84.70% among 5 modes, showing uncertain detection performance evaluations across independent modes. Motivated by such diversities on OoD loss landscape across modes, we revisit the deep ensemble method for OoD detection through mode ensemble, leading to improved performance and benefiting the OoD detector with reduced variances. Extensive experiments covering varied OoD detectors and network structures illustrate high variances across modes and validate the superiority of mode ensemble in boosting OoD detection. We hope this work could attract attention in the view of independent modes in the loss landscape of OoD data and more reliable evaluations on OoD detectors.
9.8LGFeb 23, 2023
Investigating Catastrophic Overfitting in Fast Adversarial Training: A Self-fitting PerspectiveZhengbao He, Tao Li, Sizhe Chen et al.
Although fast adversarial training provides an efficient approach for building robust networks, it may suffer from a serious problem known as catastrophic overfitting (CO), where multi-step robust accuracy suddenly collapses to zero. In this paper, we for the first time decouple single-step adversarial examples into data-information and self-information, which reveals an interesting phenomenon called "self-fitting". Self-fitting, i.e., the network learns the self-information embedded in single-step perturbations, naturally leads to the occurrence of CO. When self-fitting occurs, the network experiences an obvious "channel differentiation" phenomenon that some convolution channels accounting for recognizing self-information become dominant, while others for data-information are suppressed. In this way, the network can only recognize images with sufficient self-information and loses generalization ability to other types of data. Based on self-fitting, we provide new insights into the existing methods to mitigate CO and extend CO to multi-step adversarial training. Our findings reveal a self-learning mechanism in adversarial training and open up new perspectives for suppressing different kinds of information to mitigate CO.
6.8CVOct 26, 2023
Low-Dimensional Gradient Helps Out-of-Distribution DetectionYingwen Wu, Tao Li, Xinwen Cheng et al.
Detecting out-of-distribution (OOD) samples is essential for ensuring the reliability of deep neural networks (DNNs) in real-world scenarios. While previous research has predominantly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark with ResNet50 model, our method achieves an average reduction of 11.15$\%$ in the false positive rate at 95$\%$ recall (FPR95) compared to the current state-of-the-art approach. The code would be released.
Multi-head Ensemble of Smoothed Classifiers for Certified RobustnessKun Fang, Qinghua Tao, Yingwen Wu et al.
Randomized Smoothing (RS) is a promising technique for certified robustness, and recently in RS the ensemble of multiple Deep Neural Networks (DNNs) has shown state-of-the-art performances due to its variance reduction effect over Gaussian noises. However, such an ensemble brings heavy computation burdens in both training and certification, and yet under-exploits individual DNNs and their mutual effects, as the communication between these classifiers is commonly ignored in optimization. In this work, we consider a novel ensemble-based training way for a single DNN with multiple augmented heads, named as SmOothed Multi-head Ensemble (SOME). In SOME, similar to the pursuit of variance reduction via ensemble, an ensemble of multiple heads imposed with a cosine constraint inside a single DNN is employed with much cheaper training and certification computation overloads in RS. In such network structure, an associated training strategy is designed by introducing a circular communication flow among those augmented heads. That is, each head teaches its neighbor with the self-paced learning strategy using smoothed losses, which are specifically designed in relation to certified robustness. The deployed multi-head structure and the circular-teaching scheme in SOME jointly contribute to the diversities among multiple heads and benefit their ensemble, leading to a competitively stronger certifiably-robust RS-based defense than ensembling multiple DNNs (effectiveness) at the cost of much less computational expenses (efficiency), verified by extensive experiments and discussions.
26.0LGSep 29, 2024
Unified Gradient-Based Machine Unlearning with Remain Geometry EnhancementZhehao Huang, Xinwen Cheng, JingHao Zheng et al.
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks. Approximate MU is a practical method for large-scale models. Our investigation into approximate MU starts with identifying the steepest descent direction, minimizing the output Kullback-Leibler divergence to exact MU inside a parameters' neighborhood. This probed direction decomposes into three components: weighted forgetting gradient ascent, fine-tuning retaining gradient descent, and a weight saliency matrix. Such decomposition derived from Euclidean metric encompasses most existing gradient-based MU methods. Nevertheless, adhering to Euclidean space may result in sub-optimal iterative trajectories due to the overlooked geometric structure of the output probability space. We suggest embedding the unlearning update into a manifold rendered by the remaining geometry, incorporating second-order Hessian from the remaining data. It helps prevent effective unlearning from interfering with the retained performance. However, computing the second-order Hessian for large-scale models is intractable. To efficiently leverage the benefits of Hessian modulation, we propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction. Free from specific modal constraints, our approach is adaptable across computer vision unlearning tasks, including classification and generation. Extensive experiments validate our efficacy and efficiency. Notably, our method successfully performs class-forgetting on ImageNet using DiT and forgets a class on CIFAR-10 using DDPM in just 50 steps, compared to thousands of steps required by previous methods.
8.7LGSep 18, 2022
Random Fourier Features for Asymmetric KernelsMingzhen He, Fan He, Fanghui Liu et al.
The random Fourier features (RFFs) method is a powerful and popular technique in kernel approximation for scalability of kernel methods. The theoretical foundation of RFFs is based on the Bochner theorem that relates symmetric, positive definite (PD) functions to probability measures. This condition naturally excludes asymmetric functions with a wide range applications in practice, e.g., directed graphs, conditional probability, and asymmetric kernels. Nevertheless, understanding asymmetric functions (kernels) and its scalability via RFFs is unclear both theoretically and empirically. In this paper, we introduce a complex measure with the real and imaginary parts corresponding to four finite positive measures, which expands the application scope of the Bochner theorem. By doing so, this framework allows for handling classical symmetric, PD kernels via one positive measure; symmetric, non-positive definite kernels via signed measures; and asymmetric kernels via complex measures, thereby unifying them into a general framework by RFFs, named AsK-RFFs. Such approximation scheme via complex measures enjoys theoretical guarantees in the perspective of the uniform convergence. In algorithmic implementation, to speed up the kernel approximation process, which is expensive due to the calculation of total mass, we employ a subset-based fast estimation method that optimizes total masses on a sub-training set, which enjoys computational efficiency in high dimensions. Our AsK-RFFs method is empirically validated on several typical large-scale datasets and achieves promising kernel approximation performance, which demonstrate the effectiveness of AsK-RFFs.
PCR-CG: Point Cloud Registration via Deep Explicit Color and GeometryYu Zhang, Junle Yu, Xiaolin Huang et al.
In this paper, we introduce PCR-CG: a novel 3D point cloud registration module explicitly embedding the color signals into the geometry representation. Different from previous methods that only use geometry representation, our module is specifically designed to effectively correlate color into geometry for the point cloud registration task. Our key contribution is a 2D-3D cross-modality learning algorithm that embeds the deep features learned from color signals to the geometry representation. With our designed 2D-3D projection module, the pixel features in a square region centered at correspondences perceived from images are effectively correlated with point clouds. In this way, the overlapped regions can be inferred not only from point cloud but also from the texture appearances. Adding color is non-trivial. We compare against a variety of baselines designed for adding color to 3D, such as exhaustively adding per-pixel features or RGB values in an implicit manner. We leverage Predator [25] as the baseline method and incorporate our proposed module onto it. To validate the effectiveness of 2D features, we ablate different 2D pre-trained networks and show a positive correlation between the pre-trained weights and the task performance. Our experimental results indicate a significant improvement of 6.5% registration recall over the baseline method on the 3DLoMatch benchmark. We additionally evaluate our approach on SOTA methods and observe consistent improvements, such as an improvement of 2.4% registration recall over GeoTransformer as well as 3.5% over CoFiNet. Our study reveals a significant advantages of correlating explicit deep color features to the point cloud in the registration task.
Self-similarity Prior Distillation for Unsupervised Remote Physiological MeasurementXinyu Zhang, Weiyu Sun, Hao Lu et al.
Remote photoplethysmography (rPPG) is a noninvasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similar prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to extract more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm to assist the network in disentangling self-similar physiological patterns from facial videos. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD maintains the lowest inference time and computation cost among end-to-end models.
Online Continual Learning via Logit Adjusted SoftmaxZhehao Huang, Tao Li, Chenhe Yuan et al.
Online continual learning is a challenging problem where models must learn from a non-stationary data stream while avoiding catastrophic forgetting. Inter-class imbalance during training has been identified as a major cause of forgetting, leading to model prediction bias towards recently learned classes. In this paper, we theoretically analyze that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the Bayes-optimal classifier. To that end, we present that a simple adjustment of model logits during training can effectively resist prior class bias and pursue the corresponding Bayes-optimum. Our proposed method, Logit Adjusted Softmax, can mitigate the impact of inter-class imbalance not only in class-incremental but also in realistic general setups, with little additional computational cost. We evaluate our approach on various benchmarks and demonstrate significant performance improvements compared to prior arts. For example, our approach improves the best baseline by 4.6% on CIFAR10.
2.6CVJul 4, 2022
BYHE: A Simple Framework for Boosting End-to-end Video-based Heart Rate Measurement NetworkWeiyu Sun, Xinyu Zhang, Ying Chen et al.
Heart rate measuring based on remote photoplethysmography (rPPG) plays an important role in health caring, which estimates heart rate from facial video in a non-contact, less-constrained way. End-to-end neural network is a main branch of rPPG-based heart rate estimation methods, whose trait is recovering rPPG signal containing sufficient heart rate message from original facial video directly. However, there exists some easily neglected problems on relevant datasets which thwarting the efficient training of end-to-end methods, such as uncertain temporal delay and indefinite envelope shape of label waves. Although many novel and powerful networks are proposed, hitherto there are no systematic research digging into these problems. In this paper, from perspective of common intrinsic rhythm periodical self-similarity results from cardiac activities, we propose a comprehensive methodology, Boost Your Heartbeat Estimation (BYHE), including new label representations, corresponding network adjustments and loss functions. BYHE can be easily grafted on current end-to-end network and boost its training efficiency. By applying our methodology, we can save tremendous time without conducting laborious handworks, such as label wave alignment which is necessary for previous end-to-end methods, and meanwhile enhance the utilization on datasets. According to our experiments, BYHE can leverage classical end-to-end network to reach competitive performance against those state-of-the-art methods on mostly used datasets. Such improvement indicates selecting perspicuous and efficient label representation is also a promising direction towards better remote physiological signal measurement.
Revisiting Random Weight Perturbation for Efficiently Improving GeneralizationTao Li, Qinghua Tao, Weihao Yan et al.
Improving the generalization ability of modern deep neural networks (DNNs) is a fundamental challenge in machine learning. Two branches of methods have been proposed to seek flat minima and improve generalization: one led by sharpness-aware minimization (SAM) minimizes the worst-case neighborhood loss through adversarial weight perturbation (AWP), and the other minimizes the expected Bayes objective with random weight perturbation (RWP). While RWP offers advantages in computation and is closely linked to AWP on a mathematical basis, its empirical performance has consistently lagged behind that of AWP. In this paper, we revisit the use of RWP for improving generalization and propose improvements from two perspectives: i) the trade-off between generalization and convergence and ii) the random perturbation generation. Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP.
5.0CVMar 26, 2023
Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature Fusion in Dynamic ScenesJiquan Zhong, Xiaolin Huang, Xiao Yu
Multi-frame methods improve monocular depth estimation over single-frame approaches by aggregating spatial-temporal information via feature matching. However, the spatial-temporal feature leads to accuracy degradation in dynamic scenes. To enhance the performance, recent methods tend to propose complex architectures for feature matching and dynamic scenes. In this paper, we show that a simple learning framework, together with designed feature augmentation, leads to superior performance. (1) A novel dynamic objects detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. (2) Multi-scale feature fusion is proposed for feature matching in the multi-frame depth network, which improves feature matching, especially between frames with large camera motion. (3) The robust knowledge distillation with a robust teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without computation complexity increase during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.
Friendly Sharpness-Aware MinimizationTao Li, Pan Zhou, Zhengbao He et al.
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success, the mechanisms behind SAM's generalization enhancements remain elusive, limiting its progress in deep learning optimization. In this work, we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation, i.e., the current minibatch gradient, which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components, we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset, creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights, F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients, and then leverages stochastic gradient noise for improved generalization. Moreover, we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM.
Subspace Adversarial TrainingTao Li, Yingwen Wu, Sizhe Chen et al.
Single-step adversarial training (AT) has received wide attention as it proved to be both efficient and robust. However, a serious problem of catastrophic overfitting exists, i.e., the robust accuracy against projected gradient descent (PGD) attack suddenly drops to 0% during the training. In this paper, we approach this problem from a novel perspective of optimization and firstly reveal the close link between the fast-growing gradient of each sample and overfitting, which can also be applied to understand robust overfitting in multi-step AT. To control the growth of the gradient, we propose a new AT method, Subspace Adversarial Training (Sub-AT), which constrains AT in a carefully extracted subspace. It successfully resolves both kinds of overfitting and significantly boosts the robustness. In subspace, we also allow single-step AT with larger steps and larger radius, further improving the robustness performance. As a result, we achieve state-of-the-art single-step AT performance. Without any regularization term, our single-step AT can reach over 51% robust accuracy against strong PGD-50 attack of radius 8/255 on CIFAR-10, reaching a competitive performance against standard multi-step PGD-10 AT with huge computational advantages. The code is released at https://github.com/nblt/Sub-AT.
A Generalized Framework for Edge-preserving and Structure-preserving Image SmoothingWei Liu, Pingping Zhang, Yinjie Lei et al.
Image smoothing is a fundamental procedure in applications of both computer vision and graphics. The required smoothing properties can be different or even contradictive among different tasks. Nevertheless, the inherent smoothing nature of one smoothing operator is usually fixed and thus cannot meet the various requirements of different applications. In this paper, we first introduce the truncated Huber penalty function which shows strong flexibility under different parameter settings. A generalized framework is then proposed with the introduced truncated Huber penalty function. When combined with its strong flexibility, our framework is able to achieve diverse smoothing natures where contradictive smoothing behaviors can even be achieved. It can also yield the smoothing behavior that can seldom be achieved by previous methods, and superior performance is thus achieved in challenging cases. These together enable our framework capable of a range of applications and able to outperform the state-of-the-art approaches in several tasks, such as image detail enhancement, clip-art compression artifacts removal, guided depth map restoration, image texture removal, etc. In addition, an efficient numerical solution is provided and its convergence is theoretically guaranteed even the optimization framework is non-convex and non-smooth. A simple yet effective approach is further proposed to reduce the computational cost of our method while maintaining its performance. The effectiveness and superior performance of our approach are validated through comprehensive experiments in a range of applications. Our code is available at https://github.com/wliusjtu/Generalized-Smoothing-Framework.
Query Attack by Multi-Identity SurrogatesSizhe Chen, Zhehao Huang, Qinghua Tao et al.
Deep Neural Networks (DNNs) are acknowledged as vulnerable to adversarial attacks, while the existing black-box attacks require extensive queries on the victim DNN to achieve high success rates. For query-efficiency, surrogate models of the victim are used to generate transferable Adversarial Examples (AEs) because of their Gradient Similarity (GS), i.e., surrogates' attack gradients are similar to the victim's ones. However, it is generally neglected to exploit their similarity on outputs, namely the Prediction Similarity (PS), to filter out inefficient queries by surrogates without querying the victim. To jointly utilize and also optimize surrogates' GS and PS, we develop QueryNet, a unified attack framework that can significantly reduce queries. QueryNet creatively attacks by multi-identity surrogates, i.e., crafts several AEs for one sample by different surrogates, and also uses surrogates to decide on the most promising AE for the query. After that, the victim's query feedback is accumulated to optimize not only surrogates' parameters but also their architectures, enhancing both the GS and the PS. Although QueryNet has no access to pre-trained surrogates' prior, it reduces queries by averagely about an order of magnitude compared to alternatives within an acceptable time, according to our comprehensive experiments: 11 victims (including two commercial models) on MNIST/CIFAR10/ImageNet, allowing only 8-bit image queries, and no access to the victim's training data. The code is available at https://github.com/Sizhe-Chen/QueryNet.
Residual Enhanced Multi-Hypergraph Neural NetworkJing Huang, Xiaolin Huang, Jie Yang
Hypergraphs are a generalized data structure of graphs to model higher-order correlations among entities, which have been successfully adopted into various research domains. Meanwhile, HyperGraph Neural Network (HGNN) is currently the de-facto method for hypergraph representation learning. However, HGNN aims at single hypergraph learning and uses a pre-concatenation approach when confronting multi-modal datasets, which leads to sub-optimal exploitation of the inter-correlations of multi-modal hypergraphs. HGNN also suffers the over-smoothing issue, that is, its performance drops significantly when layers are stacked up. To resolve these issues, we propose the Residual enhanced Multi-Hypergraph Neural Network, which can not only fuse multi-modal information from each hypergraph effectively, but also circumvent the over-smoothing issue associated with HGNN. We conduct experiments on two 3D benchmarks, the NTU and the ModelNet40 datasets, and compare against multiple state-of-the-art methods. Experimental results demonstrate that both the residual hypergraph convolutions and the multi-fusion architecture can improve the performance of the base model and the combined model achieves a new state-of-the-art. Code is available at \url{https://github.com/OneForward/ResMHGNN}.
Online PCB Defect Detector On A New PCB Defect DatasetSanli Tang, Fan He, Xiaolin Huang et al.
Previous works for PCB defect detection based on image difference and image processing techniques have already achieved promising performance. However, they sometimes fall short because of the unaccounted defect patterns or over-sensitivity about some hyper-parameters. In this work, we design a deep model that accurately detects PCB defects from an input pair of a detect-free template and a defective tested image. A novel group pyramid pooling module is proposed to efficiently extract features of a large range of resolutions, which are merged by group to predict PCB defect of corresponding scales. To train the deep model, a dataset is established, namely DeepPCB, which contains 1,500 image pairs with annotations including positions of 6 common types of PCB defects. Experiment results validate the effectiveness and efficiency of the proposed model by achieving $98.6\%$ mAP @ 62 FPS on DeepPCB dataset. This dataset is now available at: https://github.com/tangsanli5201/DeepPCB.
1.4LGMar 1
Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual InformationXinwen Cheng, Jingyuan Zhang, Zhehao Huang et al.
The powerful generative capabilities of diffusion models have raised growing privacy and safety concerns regarding generating sensitive or undesired content. In response, machine unlearning (MU) -- commonly referred to as concept erasure (CE) in diffusion models -- has been introduced to remove specific knowledge from model parameters meanwhile preserving innocent knowledge. Despite recent advancements, existing unlearning methods often suffer from excessive and indiscriminate removal, which leads to substantial degradation in the quality of innocent generations. To preserve model utility, prior works rely on compensation, i.e., re-assimilating a subset of the remaining data or explicitly constraining the divergence from the pre-trained model on remaining concepts. However, we reveal that generations beyond the compensation scope still suffer, suggesting such post-remedial compensations are inherently insufficient for preserving the general utility of large-scale generative models. Therefore, in this paper, we advocate for developing compensation-free concept erasure operations, which precisely identify and eliminate the undesired knowledge such that the impact on other generations is minimal. In technique, we propose to MiM-MU, which is to unlearn a concept by minimizing the mutual information with a delicate design for computational effectiveness and for maintaining sampling distribution for other concepts. Extensive evaluations demonstrate that our proposed method achieves effective concept removal meanwhile maintaining high-quality generations for other concepts, and remarkably, without relying on any post-remedial compensation for the first time.
Kernel PCA for Out-of-Distribution DetectionKun Fang, Qinghua Tao, Kexin Lv et al.
Out-of-Distribution (OoD) detection is vital for the reliability of Deep Neural Networks (DNNs). Existing works have shown the insufficiency of Principal Component Analysis (PCA) straightforwardly applied on the features of DNNs in detecting OoD data from In-Distribution (InD) data. The failure of PCA suggests that the network features residing in OoD and InD are not well separated by simply proceeding in a linear subspace, which instead can be resolved through proper non-linear mappings. In this work, we leverage the framework of Kernel PCA (KPCA) for OoD detection, and seek suitable non-linear kernels that advocate the separability between InD and OoD data in the subspace spanned by the principal components. Besides, explicit feature mappings induced from the devoted task-specific kernels are adopted so that the KPCA reconstruction error for new test samples can be efficiently obtained with large-scale data. Extensive theoretical and empirical results on multiple OoD data sets and network structures verify the superiority of our KPCA detector in efficiency and efficacy with state-of-the-art detection performance.
14.2LGMay 24, 2024
Towards Natural Machine UnlearningZhengbao He, Tao Li, Xinwen Cheng et al.
Machine unlearning (MU) aims to eliminate information that has been learned from specific training data, namely forgetting data, from a pre-trained model. Currently, the mainstream of existing MU methods involves modifying the forgetting data with incorrect labels and subsequently fine-tuning the model. While learning such incorrect information can indeed remove knowledge, the process is quite unnatural as the unlearning process undesirably reinforces the incorrect information and leads to over-forgetting. Towards more \textit{natural} machine unlearning, we inject correct information from the remaining data to the forgetting samples when changing their labels. Through pairing these adjusted samples with their labels, the model will tend to use the injected correct information and naturally suppress the information meant to be forgotten. Albeit straightforward, such a first step towards natural machine unlearning can significantly outperform current state-of-the-art approaches. In particular, our method substantially reduces the over-forgetting and leads to strong robustness to hyperparameters, making it a promising candidate for practical machine unlearning.
Decentralized Kernel Ridge Regression Based on Data-Dependent Random FeatureRuikai Yang, Fan He, Mingzhen He et al.
Random feature (RF) has been widely used for node consistency in decentralized kernel ridge regression (KRR). Currently, the consistency is guaranteed by imposing constraints on coefficients of features, necessitating that the random features on different nodes are identical. However, in many applications, data on different nodes varies significantly on the number or distribution, which calls for adaptive and data-dependent methods that generate different RFs. To tackle the essential difficulty, we propose a new decentralized KRR algorithm that pursues consensus on decision functions, which allows great flexibility and well adapts data on nodes. The convergence is rigorously given and the effectiveness is numerically verified: by capturing the characteristics of the data on each node, while maintaining the same communication costs as other methods, we achieved an average regression accuracy improvement of 25.5\% across six real-world data sets.
Kernel PCA for Out-of-Distribution Detection: Non-Linear Kernel Selections and ApproximationsKun Fang, Qinghua Tao, Mingzhen He et al.
Out-of-Distribution (OoD) detection is vital for the reliability of deep neural networks, the key of which lies in effectively characterizing the disparities between OoD and In-Distribution (InD) data. In this work, such disparities are exploited through a fresh perspective of non-linear feature subspace. That is, a discriminative non-linear subspace is learned from InD features to capture representative patterns of InD, while informative patterns of OoD features cannot be well captured in such a subspace due to their different distribution. Grounded on this perspective, we exploit the deviations of InD and OoD features in such a non-linear subspace for effective OoD detection. To be specific, we leverage the framework of Kernel Principal Component Analysis (KPCA) to attain the discriminative non-linear subspace and deploy the reconstruction error on such subspace to distinguish InD and OoD data. Two challenges emerge: (i) the learning of an effective non-linear subspace, i.e., the selection of kernel function in KPCA, and (ii) the computation of the kernel matrix with large-scale InD data. For the former, we reveal two vital non-linear patterns that closely relate to the InD-OoD disparity, leading to the establishment of a Cosine-Gaussian kernel for constructing the subspace. For the latter, we introduce two techniques to approximate the Cosine-Gaussian kernel with significantly cheap computations. In particular, our approximation is further tailored by incorporating the InD data confidence, which is demonstrated to promote the learning of discriminative subspaces for OoD data. Our study presents new insights into the non-linear feature subspace for OoD detection and contributes practical explorations on the associated kernel design and efficient computations, yielding a KPCA detection method with distinctively improved efficacy and efficiency.
13.0LGMay 21, 2025
A Unified Gradient-based Framework for Task-agnostic Continual Learning-UnlearningZhehao Huang, Xinwen Cheng, Jie Zhang et al.
Recent advancements in deep models have highlighted the need for intelligent systems that combine continual learning (CL) for knowledge acquisition with machine unlearning (MU) for data removal, forming the Continual Learning-Unlearning (CLU) paradigm. While existing work treats CL and MU as separate processes, we reveal their intrinsic connection through a unified optimization framework based on Kullback-Leibler divergence minimization. This framework decomposes gradient updates for approximate CLU into four components: learning new knowledge, unlearning targeted data, preserving existing knowledge, and modulation via weight saliency. A critical challenge lies in balancing knowledge update and retention during sequential learning-unlearning cycles. To resolve this stability-plasticity dilemma, we introduce a remain-preserved manifold constraint to induce a remaining Hessian compensation for CLU iterations. A fast-slow weight adaptation mechanism is designed to efficiently approximate the second-order optimization direction, combined with adaptive weighting coefficients and a balanced weight saliency mask, proposing a unified implementation framework for gradient-based CLU. Furthermore, we pioneer task-agnostic CLU scenarios that support fine-grained unlearning at the cross-task category and random sample levels beyond the traditional task-aware setups. Experiments demonstrate that the proposed UG-CLU framework effectively coordinates incremental learning, precise unlearning, and knowledge stability across multiple datasets and model architectures, providing a theoretical foundation and methodological support for dynamic, compliant intelligent systems.
Data Imputation by Pursuing Better Classification: A Supervised Kernel-Based MethodRuikai Yang, Fan He, Mingzhen He et al.
Data imputation, the process of filling in missing feature elements for incomplete data sets, plays a crucial role in data-driven learning. A fundamental belief is that data imputation is helpful for learning performance, and it follows that the pursuit of better classification can guide the data imputation process. While some works consider using label information to assist in this task, their simplistic utilization of labels lacks flexibility and may rely on strict assumptions. In this paper, we propose a new framework that effectively leverages supervision information to complete missing data in a manner conducive to classification. Specifically, this framework operates in two stages. Firstly, it leverages labels to supervise the optimization of similarity relationships among data, represented by the kernel matrix, with the goal of enhancing classification accuracy. To mitigate overfitting that may occur during this process, a perturbation variable is introduced to improve the robustness of the framework. Secondly, the learned kernel matrix serves as additional supervision information to guide data imputation through regression, utilizing the block coordinate descent method. The superiority of the proposed method is evaluated on four real-world data sets by comparing it with state-of-the-art imputation methods. Remarkably, our algorithm significantly outperforms other methods when the data is missing more than 60\% of the features
6.2CVMay 22, 2025
T2I-ConBench: Text-to-Image Benchmark for Continual Post-trainingZhehao Huang, Yuhang Liu, Yixin Lou et al.
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel LearningFan He, Mingzhen He, Lei Shi et al.
Ridgeless regression has garnered attention among researchers, particularly in light of the ``Benign Overfitting'' phenomenon, where models interpolating noisy samples demonstrate robust generalization. However, kernel ridgeless regression does not always perform well due to the lack of flexibility. This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels, incorporating kernel learning techniques to improve performance in both experiments and theory. For the first time, we demonstrate that functions learned from LAB RBF kernels belong to an integral space of Reproducible Kernel Hilbert Spaces (RKHSs). Despite the absence of explicit regularization in the proposed model, its optimization is equivalent to solving an $\ell_0$-regularized problem in the integral space of RKHSs, elucidating the origin of its generalization ability. Taking an approximation analysis viewpoint, we introduce an $l_q$-norm analysis technique (with $0<q<1$) to derive the learning rate for the proposed model under mild conditions. This result deepens our theoretical understanding, explaining that our algorithm's robust approximation ability arises from the large capacity of the integral space of RKHSs, while its generalization ability is ensured by sparsity, controlled by the number of support vectors. Experimental results on both synthetic and real datasets validate our theoretical conclusions.
2.6LGMar 6, 2024
Inverse-Free Fast Natural Gradient Descent Method for Deep LearningXinwei Ou, Ce Zhu, Xiaolin Huang et al.
Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.
Learn What You Need in Personalized Federated LearningKexin Lv, Rui Ye, Xiaolin Huang et al.
Personalized federated learning aims to address data heterogeneity across local clients in federated learning. However, current methods blindly incorporate either full model parameters or predefined partial parameters in personalized federated learning. They fail to customize the collaboration manner according to each local client's data characteristics, causing unpleasant aggregation results. To address this essential issue, we propose $\textit{Learn2pFed}$, a novel algorithm-unrolling-based personalized federated learning framework, enabling each client to adaptively select which part of its local model parameters should participate in collaborative training. The key novelty of the proposed $\textit{Learn2pFed}$ is to optimize each local model parameter's degree of participant in collaboration as learnable parameters via algorithm unrolling methods. This approach brings two benefits: 1) mathmatically determining the participation degree of local model parameters in the federated collaboration, and 2) obtaining more stable and improved solutions. Extensive experiments on various tasks, including regression, forecasting, and image classification, demonstrate that $\textit{Learn2pFed}$ significantly outperforms previous personalized federated learning methods.
10.4LGFeb 3, 2022
Learning with Asymmetric Kernels: Least Squares and Feature InterpretationMingzhen He, Fan He, Lei Shi et al.
Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.
5.5LGSep 30, 2021
Semi-tensor Product-based TensorDecomposition for Neural Network CompressionHengling Zhao, Yipeng Liu, Xiaolin Huang et al.
The existing tensor networks adopt conventional matrix product for connection. The classical matrix product requires strict dimensionality consistency between factors, which can result in redundancy in data representation. In this paper, the semi-tensor product is used to generalize classical matrix product-based mode product to semi-tensor mode product. As it permits the connection of two factors with different dimensionality, more flexible and compact tensor decompositions can be obtained with smaller sizes of factors. Tucker decomposition, Tensor Train (TT) and Tensor Ring (TR) are common decomposition for low rank compression of deep neural networks. The semi-tensor product is applied to these tensor decompositions to obtained their generalized versions, i.e., semi-tensor Tucker decomposition (STTu), semi-tensor train(STT) and semi-tensor ring (STR). Experimental results show the STTu, STT and STR achieve higher compression factors than the conventional tensor decompositions with the same accuracy but less training times in ResNet and WideResNetcompression. With 2% accuracy degradation, the TT-RN (rank = 14) and the TR-WRN (rank = 16) only obtain 3 times and99t times compression factors while the STT-RN (rank = 14) and the STR-WRN (rank = 16) achieve 9 times and 179 times compression factors, respectively.
4.7CVMay 31, 2021
Adaptive Feature Alignment for Adversarial TrainingTao Wang, Ruixin Zhang, Xingyu Chen et al.
Recent studies reveal that Convolutional Neural Networks (CNNs) are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. Many adversarial defense methods improve robustness at the cost of accuracy, raising the contradiction between standard and adversarial accuracies. In this paper, we observe an interesting phenomenon that feature statistics change monotonically and smoothly w.r.t the rising of attacking strength. Based on this observation, we propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength. This is done by predicting a fusing weight in a dual-BN architecture. Unlike previous works that need to either retrain the model or manually tune a hyper-parameters for different attacking strengths, our method can deal with arbitrary attacking strengths with a single model without introducing any hyper-parameter. Importantly, our method improves the model robustness against adversarial samples without incurring much loss in standard accuracy. Experiments on CIFAR-10, SVHN, and tiny-ImageNet datasets demonstrate that our method outperforms the state-of-the-art under a wide range of attacking strengths.
3.1LGMay 31, 2021
Dominant Patterns: Critical Features Hidden in Deep Neural NetworksZhixing Ye, Shaofei Qin, Sizhe Chen et al.
In this paper, we find the existence of critical features hidden in Deep NeuralNetworks (DNNs), which are imperceptible but can actually dominate the outputof DNNs. We call these features dominant patterns. As the name suggests, for a natural image, if we add the dominant pattern of a DNN to it, the output of this DNN is determined by the dominant pattern instead of the original image, i.e., DNN's prediction is the same with the dominant pattern's. We design an algorithm to find such patterns by pursuing the insensitivity in the feature space. A direct application of the dominant patterns is the Universal Adversarial Perturbations(UAPs). Numerical experiments show that the found dominant patterns defeat state-of-the-art UAP methods, especially in label-free settings. In addition, dominant patterns are proved to have the potential to attack downstream tasks in which DNNs share the same backbone. We claim that DNN-specific dominant patterns reveal some essential properties of a DNN and are of great importance for its feature analysis and robustness enhancement.
Towards Unbiased Random Features with Lower Variance For Stationary Indefinite KernelsQin Luo, Kun Fang, Jie Yang et al.
Random Fourier Features (RFF) demonstrate wellappreciated performance in kernel approximation for largescale situations but restrict kernels to be stationary and positive definite. And for non-stationary kernels, the corresponding RFF could be converted to that for stationary indefinite kernels when the inputs are restricted to the unit sphere. Numerous methods provide accessible ways to approximate stationary but indefinite kernels. However, they are either biased or possess large variance. In this article, we propose the generalized orthogonal random features, an unbiased estimation with lower variance.Experimental results on various datasets and kernels verify that our algorithm achieves lower variance and approximation error compared with the existing kernel approximation methods. With better approximation to the originally selected kernels, improved classification accuracy and regression ability is obtained with our approximation algorithm in the framework of support vector machine and regression.
Weighted Neural Tangent Kernel: A Generalized and Improved Network-Induced KernelLei Tan, Shutong Wu, Xiaolin Huang
The Neural Tangent Kernel (NTK) has recently attracted intense study, as it describes the evolution of an over-parameterized Neural Network (NN) trained by gradient descent. However, it is now well-known that gradient descent is not always a good optimizer for NNs, which can partially explain the unsatisfactory practical performance of the NTK regression estimator. In this paper, we introduce the Weighted Neural Tangent Kernel (WNTK), a generalized and improved tool, which can capture an over-parameterized NN's training dynamics under different optimizers. Theoretically, in the infinite-width limit, we prove: i) the stability of the WNTK at initialization and during training, and ii) the equivalence between the WNTK regression estimator and the corresponding NN estimator with different learning rates on different parameters. With the proposed weight update algorithm, both empirical and analytical WNTKs outperform the corresponding NTKs in numerical experiments.
Low Dimensional Landscape Hypothesis is True: DNNs can be Trained in Tiny SubspacesTao Li, Lei Tan, Qinghua Tao et al.
Deep neural networks (DNNs) usually contain massive parameters, but there is redundancy such that it is guessed that the DNNs could be trained in low-dimensional subspaces. In this paper, we propose a Dynamic Linear Dimensionality Reduction (DLDR) based on low-dimensional properties of the training trajectory. The reduction is efficient, which is supported by comprehensive experiments: optimization in 40 dimensional spaces can achieve comparable performance as regular training over thousands or even millions of parameters. Since there are only a few optimization variables, we develop a quasi-Newton-based algorithm and also obtain robustness against label noises, which are two follow-up experiments to show the advantages of finding low-dimensional subspaces.
4.4LGFeb 20, 2021
Measuring the Transferability of $\ell_\infty$ Attacks by the $\ell_2$ NormSizhe Chen, Qinghua Tao, Zhixing Ye et al.
Deep neural networks could be fooled by adversarial examples with trivial differences to original samples. To keep the difference imperceptible in human eyes, researchers bound the adversarial perturbations by the $\ell_\infty$ norm, which is now commonly served as the standard to align the strength of different attacks for a fair comparison. However, we propose that using the $\ell_\infty$ norm alone is not sufficient in measuring the attack strength, because even with a fixed $\ell_\infty$ distance, the $\ell_2$ distance also greatly affects the attack transferability between models. Through the discovery, we reach more in-depth understandings towards the attack mechanism, i.e., several existing methods attack black-box models better partly because they craft perturbations with 70% to 130% larger $\ell_2$ distances. Since larger perturbations naturally lead to better transferability, we thereby advocate that the strength of attacks should be simultaneously measured by both the $\ell_\infty$ and $\ell_2$ norm. Our proposal is firmly supported by extensive experiments on ImageNet dataset from 7 attacks, 4 white-box models, and 9 black-box models.
18.7IVDec 10, 2020
Learning Tubule-Sensitive CNNs for Pulmonary Airway and Artery-Vein Segmentation in CTYulei Qin, Hao Zheng, Yun Gu et al.
Training convolutional neural networks (CNNs) for segmentation of pulmonary airway, artery, and vein is challenging due to sparse supervisory signals caused by the severe class imbalance between tubular targets and background. We present a CNNs-based method for accurate airway and artery-vein segmentation in non-contrast computed tomography. It enjoys superior sensitivity to tenuous peripheral bronchioles, arterioles, and venules. The method first uses a feature recalibration module to make the best use of features learned from the neural networks. Spatial information of features is properly integrated to retain relative priority of activated regions, which benefits the subsequent channel-wise recalibration. Then, attention distillation module is introduced to reinforce representation learning of tubular objects. Fine-grained details in high-resolution attention maps are passing down from one layer to its previous layer recursively to enrich context. Anatomy prior of lung context map and distance transform map is designed and incorporated for better artery-vein differentiation capacity. Extensive experiments demonstrated considerable performance gains brought by these components. Compared with state-of-the-art methods, our method extracted much more branches while maintaining competitive overall segmentation performance. Codes and models are available at http://www.pami.sjtu.edu.cn/News/56