CVMar 23, 2023
Explore the Power of Synthetic Data on Few-shot Object DetectionShaobo Lin, Kun Wang, Xingyu Zeng et al.
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.
CVFeb 28, 2023
An Effective Crop-Paste Pipeline for Few-shot Object DetectionShaobo Lin, Kun Wang, Xingyu Zeng et al.
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. However, detecting novel categories with only a few samples usually leads to the problem of misclassification. In FSOD, we notice the false positive (FP) of novel categories is prominent, in which the base categories are often recognized as novel ones. To address this issue, a novel data augmentation pipeline that Crops the Novel instances and Pastes them on the selected Base images, called CNPB, is proposed. There are two key questions to be answered: (1) How to select useful base images? and (2) How to combine novel and base data? We design a multi-step selection strategy to find useful base data. Specifically, we first discover the base images which contain the FP of novel categories and select a certain amount of samples from them for the base and novel categories balance. Then the bad cases, such as the base images that have unlabeled ground truth or easily confused base instances, are removed by using CLIP. Finally, the same category strategy is adopted, in which a novel instance with category n is pasted on the base image with the FP of n. During combination, a novel instance is cropped and randomly down-sized, and thus pasted at the assigned optimal location from the randomly generated candidates in a selected base image. Our method is simple yet effective and can be easy to plug into existing FSOD methods, demonstrating significant potential for use. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method.
CVOct 12, 2022
A Unified Framework with Meta-dropout for Few-shot LearningShaobo Lin, Xingyu Zeng, Rui Zhao
Conventional training of deep neural networks usually requires a substantial amount of data with expensive human annotations. In this paper, we utilize the idea of meta-learning to explain two very different streams of few-shot learning, i.e., the episodic meta-learning-based and pre-train finetune-based few-shot learning, and form a unified meta-learning framework. In order to improve the generalization power of our framework, we propose a simple yet effective strategy named meta-dropout, which is applied to the transferable knowledge generalized from base categories to novel categories. The proposed strategy can effectively prevent neural units from co-adapting excessively in the meta-training stage. Extensive experiments on the few-shot object detection and few-shot image classification datasets, i.e., Pascal VOC, MS COCO, CUB, and mini-ImageNet, validate the effectiveness of our method.
CVJan 26, 2023
Explore the Power of Dropout on Few-shot LearningShaobo Lin, Xingyu Zeng, Rui Zhao
The generalization power of the pre-trained model is the key for few-shot deep learning. Dropout is a regularization technique used in traditional deep learning methods. In this paper, we explore the power of dropout on few-shot learning and provide some insights about how to use it. Extensive experiments on the few-shot object detection and few-shot image classification datasets, i.e., Pascal VOC, MS COCO, CUB, and mini-ImageNet, validate the effectiveness of our method.
LGMar 22, 2018
Learning through deterministic assignment of hidden parametersJian Fang, Shaobo Lin, Zongben Xu
Supervised learning frequently boils down to determining hidden and bright parameters in a parameterized hypothesis space based on finite input-output samples. The hidden parameters determine the attributions of hidden predictors or the nonlinear mechanism of an estimator, while the bright parameters characterize how hidden predictors are linearly combined or the linear mechanism. In traditional learning paradigm, hidden and bright parameters are not distinguished and trained simultaneously in one learning process. Such an one-stage learning (OSL) brings a benefit of theoretical analysis but suffers from the high computational burden. To overcome this difficulty, a two-stage learning (TSL) scheme, featured by learning through deterministic assignment of hidden parameters (LtDaHP) was proposed, which suggests to deterministically generate the hidden parameters by using minimal Riesz energy points on a sphere and equally spaced points in an interval. We theoretically show that with such deterministic assignment of hidden parameters, LtDaHP with a neural network realization almost shares the same generalization performance with that of OSL. We also present a series of simulations and application examples to support the outperformance of LtDaHP
OCMar 1, 2018
Global Convergence of Block Coordinate Descent in Deep LearningJinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin et al.
Deep learning has aroused extensive attention due to its great empirical success. The efficiency of the block coordinate descent (BCD) methods has been recently demonstrated in deep neural network (DNN) training. However, theoretical studies on their convergence properties are limited due to the highly nonconvex nature of DNN training. In this paper, we aim at providing a general methodology for provable convergence guarantees for this type of methods. In particular, for most of the commonly used DNN training models involving both two- and three-splitting schemes, we establish the global convergence to a critical point at a rate of ${\cal O}(1/k)$, where $k$ is the number of iterations. The results extend to general loss functions which have Lipschitz continuous gradients and deep residual networks (ResNets). Our key development adds several new elements to the Kurdyka-Łojasiewicz inequality framework that enables us to carry out the global convergence analysis of BCD in the general scenario of deep learning.
LGApr 30, 2016
Constructive neural network learningShaobo Lin, Jinshan Zeng, Xiaoqin Zhang
In this paper, we aim at developing scalable neural network-type learning systems. Motivated by the idea of "constructive neural networks" in approximation theory, we focus on "constructing" rather than "training" feed-forward neural networks (FNNs) for learning, and propose a novel FNNs learning system called the constructive feed-forward neural network (CFN). Theoretically, we prove that the proposed method not only overcomes the classical saturation problem for FNN approximation, but also reaches the optimal learning rate when the regression function is smooth, while the state-of-the-art learning rates established for traditional FNNs are only near optimal (up to a logarithmic factor). A series of numerical simulations are provided to show the efficiency and feasibility of CFN via comparing with the well-known regularized least squares (RLS) with Gaussian kernel and extreme learning machine (ELM).
LGApr 20, 2016
Greedy Criterion in Orthogonal Greedy LearningLin Xu, Shaobo Lin, Jinshan Zeng et al.
Orthogonal greedy learning (OGL) is a stepwise learning scheme that starts with selecting a new atom from a specified dictionary via the steepest gradient descent (SGD) and then builds the estimator through orthogonal projection. In this paper, we find that SGD is not the unique greedy criterion and introduce a new greedy criterion, called "$δ$-greedy threshold" for learning. Based on the new greedy criterion, we derive an adaptive termination rule for OGL. Our theoretical study shows that the new learning scheme can achieve the existing (almost) optimal learning rate of OGL. Plenty of numerical experiments are provided to support that the new scheme can achieve almost optimal generalization performance, while requiring less computation than OGL.
LGJan 23, 2016
Divide and Conquer Local Average RegressionXiangyu Chang, Shaobo Lin, Yao Wang
The divide and conquer strategy, which breaks a massive data set into a se- ries of manageable data blocks, and then combines the independent results of data blocks to obtain a final decision, has been recognized as a state-of-the-art method to overcome challenges of massive data analysis. In this paper, we merge the divide and conquer strategy with local average regression methods to infer the regressive relationship of input-output pairs from a massive data set. After theoretically analyzing the pros and cons, we find that although the divide and conquer local average regression can reach the optimal learning rate, the restric- tion to the number of data blocks is a bit strong, which makes it only feasible for small number of data blocks. We then propose two variants to lessen (or remove) this restriction. Our results show that these variants can achieve the optimal learning rate with much milder restriction (or without such restriction). Extensive experimental studies are carried out to verify our theoretical assertions.
LGMay 17, 2015
Shrinkage degree in $L_2$-re-scale boosting for regressionLin Xu, Shaobo Lin, Yao Wang et al.
Re-scale boosting (RBoosting) is a variant of boosting which can essentially improve the generalization performance of boosting learning. The key feature of RBoosting lies in introducing a shrinkage degree to re-scale the ensemble estimate in each gradient-descent step. Thus, the shrinkage degree determines the performance of RBoosting. The aim of this paper is to develop a concrete analysis concerning how to determine the shrinkage degree in $L_2$-RBoosting. We propose two feasible ways to select the shrinkage degree. The first one is to parameterize the shrinkage degree and the other one is to develope a data-driven approach of it. After rigorously analyzing the importance of the shrinkage degree in $L_2$-RBoosting learning, we compare the pros and cons of the proposed methods. We find that although these approaches can reach the same learning rates, the structure of the final estimate of the parameterized approach is better, which sometimes yields a better generalization capability when the number of sample is finite. With this, we recommend to parameterize the shrinkage degree of $L_2$-RBoosting. To this end, we present an adaptive parameter-selection strategy for shrinkage degree and verify its feasibility through both theoretical analysis and numerical verification. The obtained results enhance the understanding of RBoosting and further give guidance on how to use $L_2$-RBoosting for regression tasks.
LGMay 6, 2015
Re-scale boosting for regression and classificationShaobo Lin, Yao Wang, Lin Xu
Boosting is a learning scheme that combines weak prediction rules to produce a strong composite estimator, with the underlying intuition that one can obtain accurate prediction rules by combining "rough" ones. Although boosting is proved to be consistent and overfitting-resistant, its numerical convergence rate is relatively slow. The aim of this paper is to develop a new boosting strategy, called the re-scale boosting (RBoosting), to accelerate the numerical convergence rate and, consequently, improve the learning performance of boosting. Our studies show that RBoosting possesses the almost optimal numerical convergence rate in the sense that, up to a logarithmic factor, it can reach the minimax nonlinear approximation rate. We then use RBoosting to tackle both the classification and regression problems, and deduce a tight generalization error estimate. The theoretical and experimental results show that RBoosting outperforms boosting in terms of generalization.
NAJul 12, 2015
A Gauss-Seidel Iterative Thresholding Algorithm for lq Regularized Least Squares RegressionJinshan Zeng, Zhimin Peng, Shaobo Lin
In recent studies on sparse modeling, $l_q$ ($0<q<1$) regularized least squares regression ($l_q$LS) has received considerable attention due to its superiorities on sparsity-inducing and bias-reduction over the convex counterparts. In this paper, we propose a Gauss-Seidel iterative thresholding algorithm (called GAITA) for solution to this problem. Different from the classical iterative thresholding algorithms using the Jacobi updating rule, GAITA takes advantage of the Gauss-Seidel rule to update the coordinate coefficients. Under a mild condition, we can justify that the support set and sign of an arbitrary sequence generated by GAITA will converge within finite iterations. This convergence property together with the Kurdyka-Łojasiewicz property of ($l_q$LS) naturally yields the strong convergence of GAITA under the same condition as above, which is generally weaker than the condition for the convergence of the classical iterative thresholding algorithms. Furthermore, we demonstrate that GAITA converges to a local minimizer under certain additional conditions. A set of numerical experiments are provided to show the effectiveness, particularly, much faster convergence of GAITA as compared with the classical iterative thresholding algorithms.
LGMar 7, 2015
Model selection of polynomial kernel regressionShaobo Lin, Xingping Sun, Zongben Xu et al.
Polynomial kernel regression is one of the standard and state-of-the-art learning strategies. However, as is well known, the choices of the degree of polynomial kernel and the regularization parameter are still open in the realm of model selection. The first aim of this paper is to develop a strategy to select these parameters. On one hand, based on the worst-case learning rate analysis, we show that the regularization term in polynomial kernel regression is not necessary. In other words, the regularization parameter can decrease arbitrarily fast when the degree of the polynomial kernel is suitable tuned. On the other hand,taking account of the implementation of the algorithm, the regularization term is required. Summarily, the effect of the regularization term in polynomial kernel regression is only to circumvent the " ill-condition" of the kernel matrix. Based on this, the second purpose of this paper is to propose a new model selection strategy, and then design an efficient learning algorithm. Both theoretical and experimental analysis show that the new strategy outperforms the previous one. Theoretically, we prove that the new learning strategy is almost optimal if the regression function is smooth. Experimentally, it is shown that the new strategy can significantly reduce the computational burden without loss of generalization capability.
LGFeb 14, 2015
Nonparametric regression using needlet kernels for spherical dataShaobo Lin
Needlets have been recognized as state-of-the-art tools to tackle spherical data, due to their excellent localization properties in both spacial and frequency domains. This paper considers developing kernel methods associated with the needlet kernel for nonparametric regression problems whose predictor variables are defined on a sphere. Due to the localization property in the frequency domain, we prove that the regularization parameter of the kernel ridge regression associated with the needlet kernel can decrease arbitrarily fast. A natural consequence is that the regularization term for the kernel ridge regression is not necessary in the sense of rate optimality. Based on the excellent localization property in the spacial domain further, we also prove that all the $l^{q}$ $(01\leq q < \infty)$ kernel regularization estimates associated with the needlet kernel, including the kernel lasso estimate and the kernel bridge estimate, possess almost the same generalization capability for a large range of regularization parameters in the sense of rate optimality. This finding tentatively reveals that, if the needlet kernel is utilized, then the choice of $q$ might not have a strong impact in terms of the generalization capability in some modeling contexts. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..
LGNov 13, 2014
Greedy metrics in orthogonal greedy learningLin Xu, Shaobo Lin, Jinshan Zeng et al.
Orthogonal greedy learning (OGL) is a stepwise learning scheme that adds a new atom from a dictionary via the steepest gradient descent and build the estimator via orthogonal projecting the target function to the space spanned by the selected atoms in each greedy step. Here, "greed" means choosing a new atom according to the steepest gradient descent principle. OGL then avoids the overfitting/underfitting by selecting an appropriate iteration number. In this paper, we point out that the overfitting/underfitting can also be avoided via redefining "greed" in OGL. To this end, we introduce a new greedy metric, called $δ$-greedy thresholds, to refine "greed" and theoretically verifies its feasibility. Furthermore, we reveals that such a greedy metric can bring an adaptive termination rule on the premise of maintaining the prominent learning performance of OGL. Our results show that the steepest gradient descent is not the unique greedy metric of OGL and some other more suitable metric may lessen the hassle of model-selection of OGL.
LGSep 18, 2014
Learning and approximation capability of orthogonal super greedy algorithmJian Fang, Shaobo Lin, Zongben Xu
We consider the approximation capability of orthogonal super greedy algorithms (OSGA) and its applications in supervised learning. OSGA is concerned with selecting more than one atoms in each iteration step, which, of course, greatly reduces the computational burden when compared with the conventional orthogonal greedy algorithm (OGA). We prove that even for function classes that are not the convex hull of the dictionary, OSGA does not degrade the approximation capability of OGA provided the dictionary is incoherent. Based on this, we deduce a tight generalization error bound for OSGA learning. Our results show that in the realm of supervised learning, OSGA provides a possibility to further reduce the computational burden of OGA in the premise of maintaining its prominent generalization capability.
LGJan 24, 2014
Is Extreme Learning Machine Feasible? A Theoretical Assessment (Part II)Shaobo Lin, Xia Liu, Jian Fang et al.
An extreme learning machine (ELM) can be regarded as a two stage feed-forward neural network (FNN) learning system which randomly assigns the connections with and within hidden neurons in the first stage and tunes the connections with output neurons in the second stage. Therefore, ELM training is essentially a linear learning problem, which significantly reduces the computational burden. Numerous applications show that such a computation burden reduction does not degrade the generalization capability. It has, however, been open that whether this is true in theory. The aim of our work is to study the theoretical feasibility of ELM by analyzing the pros and cons of ELM. In the previous part on this topic, we pointed out that via appropriate selection of the activation function, ELM does not degrade the generalization capability in the expectation sense. In this paper, we launch the study in a different direction and show that the randomness of ELM also leads to certain negative consequences. On one hand, we find that the randomness causes an additional uncertainty problem of ELM, both in approximation and learning. On the other hand, we theoretically justify that there also exists an activation function such that the corresponding ELM degrades the generalization capability. In particular, we prove that the generalization capability of ELM with Gaussian kernel is essentially worse than that of FNN with Gaussian kernel. To facilitate the use of ELM, we also provide a remedy to such a degradation. We find that the well-developed coefficient regularization technique can essentially improve the generalization capability. The obtained results reveal the essential characteristic of ELM and give theoretical guidance concerning how to use ELM.
LGDec 19, 2013
Learning rates of $l^q$ coefficient regularization learning with Gaussian kernelShaobo Lin, Jinshan Zeng, Jian Fang et al.
Regularization is a well recognized powerful strategy to improve the performance of a learning machine and $l^q$ regularization schemes with $0<q<\infty$ are central in use. It is known that different $q$ leads to different properties of the deduced estimators, say, $l^2$ regularization leads to smooth estimators while $l^1$ regularization leads to sparse estimators. Then, how does the generalization capabilities of $l^q$ regularization learning vary with $q$? In this paper, we study this problem in the framework of statistical learning theory and show that implementing $l^q$ coefficient regularization schemes in the sample dependent hypothesis space associated with Gaussian kernel can attain the same almost optimal learning rates for all $0<q<\infty$. That is, the upper and lower bounds of learning rates for $l^q$ regularization learning are asymptotically identical for all $0<q<\infty$. Our finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact with respect to the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..
LGJul 25, 2013
Does generalization performance of $l^q$ regularization learning depend on $q$? A negative exampleShaobo Lin, Chen Xu, Jingshan Zeng et al.
$l^q$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a $l^q$ estimator differs in varying choices of the regularization order $q$. In particular, $l^1$ leads to the LASSO estimate, while $l^{2}$ corresponds to the smooth ridge regression. This makes the order $q$ a potential tuning parameter in applications. To facilitate the use of $l^{q}$-regularization, we intend to seek for a modeling strategy where an elaborative selection on $q$ is avoidable. In this spirit, we place our investigation within a general framework of $l^{q}$-regularized kernel learning under a sample dependent hypothesis space (SDHS). For a designated class of kernel functions, we show that all $l^{q}$ estimators for $0< q < \infty$ attain similar generalization error bounds. These estimated bounds are almost optimal in the sense that up to a logarithmic factor, the upper and lower bounds are asymptotically identical. This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..