MLJul 30, 2020Code
Instrument variable detection with graph learning : an application to high dimensional GIS-census data for house pricingNing Xu, Timothy C. G. Fisher, Jian Hong
Endogeneity bias and instrument variable validation have always been important topics in statistics and econometrics. In the era of big data, such issues typically combine with dimensionality issues and, hence, require even more attention. In this paper, we merge two well-known tools from machine learning and biostatistics---variable selection algorithms and probablistic graphs---to estimate house prices and the corresponding causal structure using 2010 data on Sydney. The estimation uses a 200-gigabyte ultrahigh dimensional database consisting of local school data, GIS information, census data, house characteristics and other socio-economic records. Using "big data", we show that it is possible to perform a data-driven instrument selection efficiently and purge out the invalid instruments. Our approach improves the sparsity of variable selection, stability and robustness in the presence of high dimensionality, complicated causal structures and the consequent multicollinearity, and recovers a sparse and intuitive causal structure. The approach also reveals an efficiency and effectiveness in endogeneity detection, instrument validation, weak instrument pruning and the selection of valid instruments. From the perspective of machine learning, the estimation results both align with and confirms the facts of Sydney house market, the classical economic theories and the previous findings of simultaneous equations modeling. Moreover, the estimation results are consistent with and supported by classical econometric tools such as two-stage least square regression and different instrument tests. All the code may be found at \url{https://github.com/isaac2math/solar_graph_learning}.
MLJul 30, 2020Code
Accuracy and stability of solar variable selection comparison under complicated dependence structuresNing Xu, Timothy C. G. Fisher, Jian Hong
In this paper we focus on the empirical variable-selection peformance of subsample-ordered least angle regression (Solar) -- a novel ultrahigh dimensional redesign of lasso -- on the empirical data with complicated dependence structures and, hence, severe multicollinearity and grouping effect issues. Previous researches show that Solar largely alleviates several known high-dimensional issues with least-angle regression and $\mathcal{L}_1$ shrinkage. Also, With the same computation load, solar yields substantiali mprovements over two lasso solvers (least-angle regression for lasso and coordinate-descent) in terms of the sparsity (37-64\% reduction in the average number of selected variables), stability and accuracy of variable selection. Simulations also demonstrate that solar enhances the robustness of variable selection to different settings of the irrepresentable condition and to variations in the dependence structures assumed in regression analysis. To confirm that the improvements are also available for empirical researches, we choose the prostate cancer data and the Sydney house price data and apply two lasso solvers, elastic net and Solar on them for comparison. The results shows that (i) lasso is affected by the grouping effect and randomly drop variables with high correlations, resulting unreliable and uninterpretable results; (ii) elastic net is more robust to grouping effect; however, it completely lose variable-selection sparsity when the dependence structure of the data is complicated; (iii) solar demonstrates its superior robustness to complicated dependence structures and grouping effect, returning variable-selection results with better stability and sparsity. The code can be found at https://github.com/isaac2math/solar_application
MLJul 30, 2020
Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional dataNing Xu, Timothy C. G. Fisher
We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).
MLJul 30, 2020
Rademacher upper bounds for cross-validation errors with an application to the lassoNing Xu, Timothy C. G. Fisher, Jian Hong
We establish a general upper bound for $K$-fold cross-validation ($K$-CV) errors that can be adapted to many $K$-CV-based estimators and learning algorithms. Based on Rademacher complexity of the model and the Orlicz-$Ψ_ν$ norm of the error process, the CV error upper bound applies to both light-tail and heavy-tail error distributions. We also extend the CV error upper bound to $β$-mixing data using the technique of independent blocking. We provide a Python package (\texttt{CVbound}, \url{https://github.com/isaac2math}) for computing the CV error upper bound in $K$-CV-based algorithms. Using the lasso as an example, we demonstrate in simulations that the upper bounds are tight and stable across different parameter settings and random seeds. As well as accurately bounding the CV errors for the lasso, the minimizer of the new upper bounds can be used as a criterion for variable selection. Compared with the CV-error minimizer, simulations show that tuning the lasso penalty parameter according to the minimizer of the upper bound yields a more sparse and more stable model that retains all of the relevant variables.
MLMay 20, 2017
$\left( β, \varpi \right)$-stability for cross-validation and the choice of the number of foldsNing Xu, Jian Hong, Timothy C. G. Fisher
In this paper, we introduce a new concept of stability for cross-validation, called the $\left( β, \varpi \right)$-stability, and use it as a new perspective to build the general theory for cross-validation. The $\left( β, \varpi \right)$-stability mathematically connects the generalization ability and the stability of the cross-validated model via the Rademacher complexity. Our result reveals mathematically the effect of cross-validation from two sides: on one hand, cross-validation picks the model with the best empirical generalization ability by validating all the alternatives on test sets; on the other hand, cross-validation may compromise the stability of the model selection by causing subsampling error. Moreover, the difference between training and test errors in q\textsuperscript{th} round, sometimes referred to as the generalization error, might be autocorrelated on q. Guided by the ideas above, the $\left( β, \varpi \right)$-stability help us derivd a new class of Rademacher bounds, referred to as the one-round/convoluted Rademacher bounds, for the stability of cross-validation in both the i.i.d.\ and non-i.i.d.\ cases. For both light-tail and heavy-tail losses, the new bounds quantify the stability of the one-round/average test error of the cross-validated model in terms of its one-round/average training error, the sample sizes $n$, number of folds $K$, the tail property of the loss (encoded as Orlicz-$Ψ_ν$ norms) and the Rademacher complexity of the model class $Λ$. The new class of bounds not only quantitatively reveals the stability of the generalization ability of the cross-validated model, it also shows empirically the optimal choice for number of folds $K$, at which the upper bound of the one-round/average test error is lowest, or, to put it in another way, where the test error is most stable.
MLOct 18, 2016
Generalization error minimization: a new approach to model evaluation and selection with an application to penalized regressionNing Xu, Jian Hong, Timothy C. G. Fisher
We study model evaluation and model selection from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. We believe that GA is one way formally to address concerns about the external validity of a model. The GA of a model estimated on a sample can be measured by its empirical out-of-sample errors, called the generalization errors (GE). We derive upper bounds for the GE, which depend on sample sizes, model complexity and the distribution of the loss function. The upper bounds can be used to evaluate the GA of a model, ex ante. We propose using generalization error minimization (GEM) as a framework for model selection. Using GEM, we are able to unify a big class of penalized regression estimators, including lasso, ridge and bridge, under the same set of assumptions. We establish finite-sample and asymptotic properties (including $\mathcal{L}_2$-consistency) of the GEM estimator for both the $n \geqslant p$ and the $n < p$ cases. We also derive the $\mathcal{L}_2$-distance between the penalized and corresponding unpenalized regression estimates. In practice, GEM can be implemented by validation or cross-validation. We show that the GE bounds can be used for selecting the optimal number of folds in $K$-fold cross-validation. We propose a variant of $R^2$, the $GR^2$, as a measure of GA, which considers both both in-sample and out-of-sample goodness of fit. Simulations are used to demonstrate our key results.
MLSep 12, 2016
Finite-sample and asymptotic analysis of generalization ability with an application to penalized regressionNing Xu, Jian Hong, Timothy C. G. Fisher
In this paper, we study the performance of extremum estimators from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. By adapting the classical concentration inequalities, we derive upper bounds on the empirical out-of-sample prediction errors as a function of the in-sample errors, in-sample data size, heaviness in the tails of the error distribution, and model complexity. We show that the error bounds may be used for tuning key estimation hyper-parameters, such as the number of folds $K$ in cross-validation. We also show how $K$ affects the bias-variance trade-off for cross-validation. We demonstrate that the $\mathcal{L}_2$-norm difference between penalized and the corresponding un-penalized regression estimates is directly explained by the GA of the estimates and the GA of empirical moment conditions. Lastly, we prove that all penalized regression estimates are $L_2$-consistent for both the $n \geqslant p$ and the $n < p$ cases. Simulations are used to demonstrate key results. Keywords: generalization ability, upper bound of generalization error, penalized regression, cross-validation, bias-variance trade-off, $\mathcal{L}_2$ difference between penalized and unpenalized regression, lasso, high-dimensional data.
MLJun 1, 2016
Model selection consistency from the perspective of generalization ability and VC theory with an application to LassoNing Xu, Jian Hong, Timothy C. G. Fisher
Model selection is difficult to analyse yet theoretically and empirically important, especially for high-dimensional data analysis. Recently the least absolute shrinkage and selection operator (Lasso) has been applied in the statistical and econometric literature. Consis- tency of Lasso has been established under various conditions, some of which are difficult to verify in practice. In this paper, we study model selection from the perspective of generalization ability, under the framework of structural risk minimization (SRM) and Vapnik-Chervonenkis (VC) theory. The approach emphasizes the balance between the in-sample and out-of-sample fit, which can be achieved by using cross-validation to select a penalty on model complexity. We show that an exact relationship exists between the generalization ability of a model and model selection consistency. By implementing SRM and the VC inequality, we show that Lasso is L2-consistent for model selection under assumptions similar to those imposed on OLS. Furthermore, we derive a probabilistic bound for the distance between the penalized extremum estimator and the extremum estimator without penalty, which is dominated by overfitting. We also propose a new measurement of overfitting, GR2, based on generalization ability, that converges to zero if model selection is consistent. Using simulations, we demonstrate that the proposed CV-Lasso algorithm performs well in terms of model selection and overfitting control.