Wenjing Liao

h-index18

30papers

1,387citations

Novelty62%

AI Score57

Ranked #5,216 of 194,257 authors (top 3%)#1,428 in LG (top 4%)

30 Papers

13.8MLJun 9, 2022

Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint

Hao Liu, Minshuo Chen, Siawpeng Er et al. · gatech

Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.

14.2MLMar 17, 2023

Deep Nonparametric Estimation of Intrinsic Data Structures by Chart Autoencoders: Generalization Error and Robustness

Hao Liu, Alex Havrilla, Rongjie Lai et al. · gatech

Autoencoders have demonstrated remarkable success in learning low-dimensional latent features of high-dimensional data across various applications. Assuming that data are sampled near a low-dimensional manifold, we employ chart autoencoders, which encode data into low-dimensional latent features on a collection of charts, preserving the topology and geometry of the data manifold. Our paper establishes statistical guarantees on the generalization error of chart autoencoders, and we demonstrate their denoising capabilities by considering $n$ noisy training samples, along with their noise-free counterparts, on a $d$-dimensional manifold. By training autoencoders, we show that chart autoencoders can effectively denoise the input data with normal noise. We prove that, under proper network architectures, chart autoencoders achieve a squared generalization error in the order of $\displaystyle n^{-\frac{2}{d+2}}\log^4 n$, which depends on the intrinsic dimension of the manifold and only weakly depends on the ambient dimension and noise level. We further extend our theory on data with noise containing both normal and tangential components, where chart autoencoders still exhibit a denoising effect for the normal component. As a special case, our theory also applies to classical autoencoders, as long as the data manifold has a global parametrization. Our results provide a solid theoretical foundation for the effectiveness of autoencoders, which is further validated through several numerical experiments.

5.1ITJul 6

Optimality of Gradient-MUSIC for spectral estimation

Albert Fannjiang, Weilin Li, Wenjing Liao

We introduce the Gradient-MUSIC algorithm for estimating the unknown frequencies and amplitudes of a nonharmonic signal from noisy time samples. While the classical MUSIC algorithm performs a computationally expensive search over a fine grid, Gradient-MUSIC is significantly more efficient and eliminates the need for discretization over a fine grid by using optimization techniques. It coarsely scans the 1D landscape to find initialization simultaneously for all frequencies followed by parallelizable local refinement via gradient descent. We also analyze its performance when the noise level is sufficiently small and the signal frequencies are separated by at least $8π/m$, where $π/m$ is the standard resolution of this problem. Even though the 1D landscape is nonconvex, we prove a global convergence result for Gradient-MUSIC: coarse scanning provably finds suitable initialization and gradient descent converges at a linear rate. In addition to convergence results, we also upper bound the error between the true signal frequencies and amplitudes with those found by Gradient-MUSIC. For example, if the noise has $\ell^\infty$ norm at most $\varepsilon$, then the frequencies and amplitudes are recovered up to error at most $C\varepsilon/m$ and $C\varepsilon$ respectively for a universal $C>0$, which are minimax optimal in $m$, $\varepsilon$, and number of frequencies. Our theory can also handle stochastic noise with performance guarantees under nonstationary independent Gaussian noise. Our main approach is a comprehensive geometric analysis of the landscape, a perspective that has not been explored before.

3.3ITSep 21, 2014

MUSIC for Single-Snapshot Spectral Estimation: Stability and Super-resolution

Wenjing Liao, Albert Fannjiang

This paper studies the problem of line spectral estimation in the continuum of a bounded interval with one snapshot of array measurement. The single-snapshot measurement data is turned into a Hankel data matrix which admits the Vandermonde decomposition and is suitable for the MUSIC algorithm. The MUSIC algorithm amounts to finding the null space (the noise space) of the Hankel matrix, forming the noise-space correlation function and identifying the s smallest local minima of the noise-space correlation as the frequency set. In the noise-free case exact reconstruction is guaranteed for any arbitrary set of frequencies as long as the number of measurements is at least twice the number of distinct frequencies to be recovered. In the presence of noise the stability analysis shows that the perturbation of the noise-space correlation is proportional to the spectral norm of the noise matrix as long as the latter is smaller than the smallest (nonzero) singular value of the noiseless Hankel data matrix. Under the assumption that frequencies are separated by at least twice the Rayleigh Length (RL), the stability of the noise-space correlation is proved by means of novel discrete Ingham inequalities which provide bounds on nonzero singular values of the noiseless Hankel data matrix. The numerical performance of MUSIC is tested in comparison with other algorithms such as BLO-OMP and SDP (TV-min). While BLO-OMP is the stablest algorithm for frequencies separated above 4 RL, MUSIC becomes the best performing one for frequencies separated between 2 RL and 3 RL. Also, MUSIC is more efficient than other methods. MUSIC truly shines when the frequency separation drops to 1 RL or below when all other methods fail. Indeed, the resolution length of MUSIC decreases to zero as noise decreases to zero as a power law with an exponent much smaller than an upper bound established by Donoho.

11.3ITJun 25, 2011

Coherence-Pattern Guided Compressive Sensing with Unresolved Grids

A. Fannjiang, W. Liao

Highly coherent sensing matrices arise in discretization of continuum imaging problems such as radar and medical imaging when the grid spacing is below the Rayleigh threshold. Algorithms based on techniques of band exclusion (BE) and local optimization (LO) are proposed to deal with such coherent sensing matrices. These techniques are embedded in the existing compressed sensing algorithms such as Orthogonal Matching Pursuit (OMP), Subspace Pursuit (SP), Iterative Hard Thresholding (IHT), Basis Pursuit (BP) and Lasso, and result in the modified algorithms BLOOMP, BLOSP, BLOIHT, BP-BLOT and Lasso-BLOT, respectively. Under appropriate conditions, it is proved that BLOOMP can reconstruct sparse, widely separated objects up to one Rayleigh length in the Bottleneck distance {\em independent} of the grid spacing. One of the most distinguishing attributes of BLOOMP is its capability of dealing with large dynamic ranges. The BLO-based algorithms are systematically tested with respect to four performance metrics: dynamic range, noise stability, sparsity and resolution. With respect to dynamic range and noise stability, BLOOMP is the best performer. With respect to sparsity, BLOOMP is the best performer for high dynamic range while for dynamic range near unity BP-BLOT and Lasso-BLOT with the optimized regularization parameter have the best performance. In the noiseless case, BP-BLOT has the highest resolving power up to certain dynamic range. The algorithms BLOSP and BLOIHT are good alternatives to BLOOMP and BP/Lasso-BLOT: they are faster than both BLOOMP and BP/Lasso-BLOT and shares, to a lesser degree, BLOOMP's amazing attribute with respect to dynamic range. Detailed comparisons with existing algorithms such as Spectral Iterative Hard Thresholding (SIHT) and the frame-adapted BP are given.

3.3LGDec 1, 2022

High Dimensional Binary Classification under Label Shift: Phase Transition and Regularization

Jiahui Cheng, Minshuo Chen, Hao Liu et al. · gatech

Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.

10.9CVMay 11Code

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Yuecheng LiulJunda Cheng, Longliang Liu, Wenjing Liao et al.

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth

10.3NANov 6, 2022Code

WeakIdent: Weak formulation for Identifying Differential Equations using Narrow-fit and Trimming

Mengyi Tang, Wenjing Liao, Rachel Kuske et al.

Data-driven identification of differential equations is an interesting but challenging problem, especially when the given data are corrupted by noise. When the governing differential equation is a linear combination of various differential terms, the identification problem can be formulated as solving a linear system, with the feature matrix consisting of linear and nonlinear terms multiplied by a coefficient vector. This product is equal to the time derivative term, and thus generates dynamical behaviors. The goal is to identify the correct terms that form the equation to capture the dynamics of the given data. We propose a general and robust framework to recover differential equations using a weak formulation, for both ordinary and partial differential equations (ODEs and PDEs). The weak formulation facilitates an efficient and robust way to handle noise. For a robust recovery against noise and the choice of hyper-parameters, we introduce two new mechanisms, narrow-fit and trimming, for the coefficient support and value recovery, respectively. For each sparsity level, Subspace Pursuit is utilized to find an initial set of support from the large dictionary. Then, we focus on highly dynamic regions (rows of the feature matrix), and error normalize the feature matrix in the narrow-fit step. The support is further updated via trimming of the terms that contribute the least. Finally, the support set of features with the smallest Cross-Validation error is chosen as the result. A comprehensive set of numerical experiments are presented for both systems of ODEs and PDEs with various noise levels. The proposed method gives a robust recovery of the coefficients, and a significant denoising effect which can handle up to $100\%$ noise-to-signal ratio for some equations. We compare the proposed method with several state-of-the-art algorithms for the recovery of differential equations.

5.3LGJun 26, 2023

Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories

Zixuan Zhang, Minshuo Chen, Mengdi Wang et al.

Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to the intrinsic data structures. In real world applications, such an assumption of data lying exactly on a low dimensional manifold is stringent. This paper introduces a relaxed assumption that the input data are concentrated around a subset of $\mathbb{R}^d$ denoted by $\mathcal{S}$, and the intrinsic dimension of $\mathcal{S}$ can be characterized by a new complexity notation -- effective Minkowski dimension. We prove that, the sample complexity of deep nonparametric regression only depends on the effective Minkowski dimension of $\mathcal{S}$ denoted by $p$. We further illustrate our theoretical findings by considering nonparametric regression with an anisotropic Gaussian random design $N(0,Σ)$, where $Σ$ is full rank. When the eigenvalues of $Σ$ have an exponential or polynomial decay, the effective Minkowski dimension of such an Gaussian random design is $p=\mathcal{O}(\sqrt{\log n})$ or $p=\mathcal{O}(n^γ)$, respectively, where $n$ is the sample size and $γ\in(0,1)$ is a small constant depending on the polynomial decay rate. Our theory shows that, when the manifold assumption does not hold, deep neural networks can still adapt to the effective Minkowski dimension of the data, and circumvent the curse of the ambient dimensionality for moderate sample sizes.

11.6MLMay 4, 2022

A Manifold Two-Sample Test Study: Integral Probability Metric with Neural Networks

Jie Wang, Minshuo Chen, Tuo Zhao et al.

Two-sample tests are important areas aiming to determine whether two collections of observations follow the same distribution or not. We propose two-sample tests based on integral probability metric (IPM) for high-dimensional samples supported on a low-dimensional manifold. We characterize the properties of proposed tests with respect to the number of samples $n$ and the structure of the manifold with intrinsic dimension $d$. When an atlas is given, we propose two-step test to identify the difference between general distributions, which achieves the type-II risk in the order of $n^{-1/\max\{d,2\}}$. When an atlas is not given, we propose Hölder IPM test that applies for data distributions with $(s,β)$-Hölder densities, which achieves the type-II risk in the order of $n^{-(s+β)/d}$. To mitigate the heavy computation burden of evaluating the Hölder IPM, we approximate the Hölder function class using neural networks. Based on the approximation theory of neural networks, we show that the neural network IPM test has the type-II risk in the order of $n^{-(s+β)/d}$, which is in the same order of the type-II risk as the Hölder IPM test. Our proposed tests are adaptive to low-dimensional geometric structure because their performance crucially depends on the intrinsic dimension instead of the data dimension.

15.6MLFeb 25, 2023

On Deep Generative Models for Approximation and Estimation of Distributions on Manifolds

Biraj Dahal, Alex Havrilla, Minshuo Chen et al.

Generative networks have experienced great empirical successes in distribution learning. Many existing experiments have demonstrated that generative networks can generate high-dimensional complex data from a low-dimensional easy-to-sample distribution. However, this phenomenon can not be justified by existing theories. The widely held manifold hypothesis speculates that real-world data sets, such as natural images and signals, exhibit low-dimensional geometric structures. In this paper, we take such low-dimensional data structures into consideration by assuming that data distributions are supported on a low-dimensional manifold. We prove statistical guarantees of generative networks under the Wasserstein-1 loss. We show that the Wasserstein-1 loss converges to zero at a fast rate depending on the intrinsic dimension instead of the ambient data dimension. Our theory leverages the low-dimensional geometric structures in data sets and justifies the practical power of generative networks. We require no smoothness assumptions on the data distribution which is desirable in practice.

8.2LGMay 26

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

Hao Liu, Zecheng Zhang, Wenjing Liao et al.

Neural scaling laws play a pivotal role in the performance of deep neural networks and have been observed in a wide range of tasks. However, a complete theoretical framework for understanding these scaling laws remains underdeveloped. In this paper, we explore the neural scaling laws for deep operator networks, which involve learning mappings between function spaces, with a focus on the Chen and Chen style architecture. These approaches, which include the popular Deep Operator Network (DeepONet), approximate the output functions using a linear combination of learnable basis functions and coefficients that depend on the input functions. We establish a theoretical framework to quantify the neural scaling laws by analyzing its approximation and generalization errors. We articulate the relationship between the approximation and generalization errors of deep operator networks and key factors such as network model size and training data size. Moreover, we address cases where input functions exhibit low-dimensional structures, allowing us to derive tighter error bounds. These results also hold for deep ReLU networks and other similar structures. Our results offer a partial explanation of the neural scaling laws in operator learning and provide a theoretical foundation for their applications.

6.3LGMay 6

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Pre-trained transformers are able to learn from examples provided as part of the prompt without any weight updates, a remarkable ability known as in-context learning (ICL). Despite its demonstrated efficacy across various domains, the theoretical understanding of ICL is still developing. Whereas most existing theory has focused on linear models, we study ICL in the nonlinear regression setting. Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size. We numerically validate the theory on synthetic regression tasks.

5.8MLMay 9

Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Zhongjie Shi, Wenjing Liao

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $α$-Hölder continuous functions with $α\in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/α})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2α}{2α+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

5.2CVNov 30, 2023Code

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Alex Havrilla, Kevin Rojas, Wenjing Liao et al.

Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving.

20.3LGNov 11, 2024Code

Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data

Alex Havrilla, Wenjing Liao

When training deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size. Perhaps the best known example of such scaling laws are for transformer-based large language models, where networks with billions of parameters are trained on trillions of tokens of text. Yet, despite sustained widespread interest, a rigorous understanding of why transformer scaling laws exist is still missing. To answer this question, we establish novel statistical estimation and mathematical approximation theories for transformers when the input data are concentrated on a low-dimensional manifold. Our theory predicts a power law between the generalization error and both the training data size and the network size for transformers, where the power depends on the intrinsic dimension $d$ of the training data. Notably, the constructed model architecture is shallow, requiring only logarithmic depth in $d$. By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry. Moreover, we test our theory with empirical observation by training LLMs on natural language datasets. We find the observed empirical data scaling laws closely agree with our theoretical predictions. Taken together, these results rigorously show the intrinsic dimension of data to be a crucial quantity affecting transformer scaling laws in both theory and practice.

11.4LGMar 11, 2025

Coefficient-to-Basis Network: A Fine-Tunable Operator Learning Framework for Inverse Problems with Adaptive Discretizations and Theoretical Guarantees

Zecheng Zhang, Hao Liu, Wenjing Liao et al.

We propose a Coefficient-to-Basis Network (C2BNet), a novel framework for solving inverse problems within the operator learning paradigm. C2BNet efficiently adapts to different discretizations through fine-tuning, using a pre-trained model to significantly reduce computational cost while maintaining high accuracy. Unlike traditional approaches that require retraining from scratch for new discretizations, our method enables seamless adaptation without sacrificing predictive performance. Furthermore, we establish theoretical approximation and generalization error bounds for C2BNet by exploiting low-dimensional structures in the underlying datasets. Our analysis demonstrates that C2BNet adapts to low-dimensional structures without relying on explicit encoding mechanisms, highlighting its robustness and efficiency. To validate our theoretical findings, we conducted extensive numerical experiments that showcase the superior performance of C2BNet on several inverse problems. The results confirm that C2BNet effectively balances computational efficiency and accuracy, making it a promising tool to solve inverse problems in scientific computing and engineering applications.

11.4LGMay 6, 2025

Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

Zhaiming Shen, Alex Havrilla, Rongjie Lai et al.

Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto the manifold. We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold. Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise. Our novel proof technique constructs representations of basic arithmetic operations by transformers, which may hold independent interest.

4.1LGMay 14, 2025

Single-shot prediction of parametric partial differential equations

Khalid Rafiq, Wenjing Liao, Aditya G. Nair

We introduce Flexi-VAE, a data-driven framework for efficient single-shot forecasting of nonlinear parametric partial differential equations (PDEs), eliminating the need for iterative time-stepping while maintaining high accuracy and stability. Flexi-VAE incorporates a neural propagator that advances latent representations forward in time, aligning latent evolution with physical state reconstruction in a variational autoencoder setting. We evaluate two propagation strategies, the Direct Concatenation Propagator (DCP) and the Positional Encoding Propagator (PEP), and demonstrate, through representation-theoretic analysis, that DCP offers superior long-term generalization by fostering disentangled and physically meaningful latent spaces. Geometric diagnostics, including Jacobian spectral analysis, reveal that propagated latent states reside in regions of lower decoder sensitivity and more stable local geometry than those derived via direct encoding, enhancing robustness for long-horizon predictions. We validate Flexi-VAE on canonical PDE benchmarks, the 1D viscous Burgers equation and the 2D advection-diffusion equation, achieving accurate forecasts across wide parametric ranges. The model delivers over 50x CPU and 90x GPU speedups compared to autoencoder-LSTM baselines for large temporal shifts. These results position Flexi-VAE as a scalable and interpretable surrogate modeling tool for accelerating high-fidelity simulations in computational fluid dynamics (CFD) and other parametric PDE-driven applications, with extensibility to higher-dimensional and more complex systems.

9.2MLJun 8, 2024

Deep Neural Networks are Adaptive to Function Regularity and Data Distribution in Approximation and Estimation

Hao Liu, Jiahui Cheng, Wenjing Liao

Deep learning has exhibited remarkable results across diverse areas. To understand its success, substantial research has been directed towards its theoretical foundations. Nevertheless, the majority of these studies examine how well deep neural networks can model functions with uniform regularity. In this paper, we explore a different angle: how deep neural networks can adapt to different regularity in functions across different locations and scales and nonuniform data distributions. More precisely, we focus on a broad class of functions defined by nonlinear tree-based approximation. This class encompasses a range of function types, such as functions with uniform regularity and discontinuous functions. We develop nonparametric approximation and estimation theories for this function class using deep ReLU networks. Our results show that deep neural networks are adaptive to different regularity of functions and nonuniform data distributions at different locations and scales. We apply our results to several function classes, and derive the corresponding approximation and generalization errors. The validity of our results is demonstrated through numerical experiments.

14.2LGJan 19, 2024

Generalization Error Guaranteed Auto-Encoder-Based Nonlinear Model Reduction for Operator Learning

Hao Liu, Biraj Dahal, Rongjie Lai et al.

Many physical processes in science and engineering are naturally represented by operators between infinite-dimensional function spaces. The problem of operator learning, in this context, seeks to extract these physical processes from empirical data, which is challenging due to the infinite or high dimensionality of data. An integral component in addressing this challenge is model reduction, which reduces both the data dimensionality and problem size. In this paper, we utilize low-dimensional nonlinear structures in model reduction by investigating Auto-Encoder-based Neural Network (AENet). AENet first learns the latent variables of the input data and then learns the transformation from these latent variables to corresponding output data. Our numerical experiments validate the ability of AENet to accurately learn the solution operator of nonlinear partial differential equations. Furthermore, we establish a mathematical and statistical estimation theory that analyzes the generalization error of AENet. Our theoretical framework shows that the sample complexity of training AENet is intricately tied to the intrinsic dimension of the modeled process, while also demonstrating the remarkable resilience of AENet to noise.

23.4MLJan 1, 2022

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces

Hao Liu, Haizhao Yang, Minshuo Chen et al.

Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.

19.0MLSep 7, 2021

Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks

Hao Liu, Minshuo Chen, Tuo Zhao et al.

Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.

10.2MLJan 13, 2021

Multiscale regression on unknown manifolds

Wenjing Liao, Mauro Maggioni, Stefano Vigogna

We consider the regression problem of estimating functions on $\mathbb{R}^D$ but supported on a $d$-dimensional manifold $ \mathcal{M} \subset \mathbb{R}^D $ with $ d \ll D $. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on $\mathcal{M}$ at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension $d$, instead of an unknown manifold embedded in $\mathbb{R}^D$. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in $D$ and exponential in $d$. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.

9.0LGNov 3, 2020

Doubly Robust Off-Policy Learning on Low-Dimensional Manifolds by Deep Neural Networks

Minshuo Chen, Hao Liu, Wenjing Liao et al.

Causal inference explores the causation between actions and the consequent rewards on a covariate set. Recently deep learning has achieved a remarkable performance in causal inference, but existing statistical theories cannot well explain such an empirical success, especially when the covariates are high-dimensional. Most theoretical results in causal inference are asymptotic, suffer from the curse of dimensionality, and only work for the finite-action scenario. To bridge such a gap between theory and practice, this paper studies doubly robust off-policy learning by deep neural networks. When the covariates lie on a low-dimensional manifold, we prove nonasymptotic regret bounds, which converge at a fast rate depending on the intrinsic dimension of the manifold. Our results cover both the finite- and continuous-action scenarios. Our theory shows that deep neural networks are adaptive to the low-dimensional geometric structures of the covariates, and partially explains the success of deep learning for causal inference.

18.9LGFeb 10, 2020

Distribution Approximation and Statistical Estimation Guarantees of Generative Adversarial Networks

Minshuo Chen, Wenjing Liao, Hongyuan Zha et al.

Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning. Despite its remarkable empirical performance, there are limited theoretical studies on the statistical properties of GANs. This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions that have densities in a Hölder space. Our main result shows that, if the generator and discriminator network architectures are properly chosen, GANs are consistent estimators of data distributions under strong discrepancy metrics, such as the Wasserstein-1 distance. Furthermore, when the data distribution exhibits low-dimensional structures, we show that GANs are capable of capturing the unknown low-dimensional structures in data and enjoy a fast statistical convergence, which is free of curse of the ambient dimensionality. Our analysis for low-dimensional data builds upon a universal approximation theory of neural networks with Lipschitz continuity guarantees, which may be of independent interest.

3.3STJan 22, 2020

Learning functions varying along a central subspace

Hao Liu, Wenjing Liao

Many functions of interest are in a high-dimensional space but exhibit low-dimensional structures. This paper studies regression of a $s$-Hölder function $f$ in $\mathbb{R}^D$ which varies along a central subspace of dimension $d$ while $d\ll D$. A direct approximation of $f$ in $\mathbb{R}^D$ with an $\varepsilon$ accuracy requires the number of samples $n$ in the order of $\varepsilon^{-(2s+D)/s}$. In this paper, we analyze the Generalized Contour Regression (GCR) algorithm for the estimation of the central subspace and use piecewise polynomials for function approximation. GCR is among the best estimators for the central subspace, but its sample complexity is an open question. We prove that GCR leads to a mean squared estimation error of $O(n^{-1})$ for the central subspace, if a variance quantity is exactly known. The estimation error of this variance quantity is also given in this paper. The mean squared regression error of $f$ is proved to be in the order of $\left(n/\log n\right)^{-\frac{2s}{2s+d}}$ where the exponent depends on the dimension of the central subspace $d$ instead of the ambient space $D$. This result demonstrates that GCR is effective in learning the low-dimensional central subspace. We also propose a modified GCR with improved efficiency. The convergence rate is validated through several numerical experiments.

22.0LGAug 5, 2019

Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Minshuo Chen, Haoming Jiang, Wenjing Liao et al.

Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of Hölder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a Hölder function in $\mathcal{H}^{s,α}$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$, with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+α)}{2(s+α) + d}}\log^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.

3.3NAApr 6, 2019

IDENT: Identifying Differential Equations with Numerical Time evolution

Sung Ha Kang, Wenjing Liao, Yingjie Liu

Identifying unknown differential equations from a given set of discrete time dependent data is a challenging problem. A small amount of noise can make the recovery unstable, and nonlinearity and differential equations with varying coefficients add complexity to the problem. We assume that the governing partial differential equation (PDE) is a linear combination of a subset of a prescribed dictionary containing different differential terms, and the objective of this paper is to find the correct coefficients. We propose a new direction based on the fundamental idea of convergence analysis of numerical PDE schemes. We utilize Lasso for efficiency, and a performance guarantee is established based on an incoherence property. The main contribution is to validate and correct the results by Time Evolution Error (TEE). The new algorithm, called Identifying Differential Equations with Numerical Time evolution (IDENT), is explored for data with non-periodic boundary conditions, noisy data and PDEs with varying coefficients. From the recovery analysis of Lasso, we propose a new definition of Noise-to-Signal ratio, which better represents the level of noise in the case of PDE identification. We systematically analyze the effects of data generations and downsampling, and propose an order preserving denoising method called Least-Squares Moving Average (LSMA), to preprocess the given data. For the identification of PDEs with varying coefficients, we propose to add Base Element Expansion (BEE) to aide the computation. Various numerical experiments from basic tests to noisy data, downsampling effects and varying coefficients are presented.

13.6MLNov 3, 2016

Adaptive Geometric Multiscale Approximations for Intrinsically Low-dimensional Data

Wenjing Liao, Mauro Maggioni

We consider the problem of efficiently approximating and encoding high-dimensional data sampled from a probability distribution $ρ$ in $\mathbb{R}^D$, that is nearly supported on a $d$-dimensional set $\mathcal{M}$ - for example supported on a $d$-dimensional Riemannian manifold. Geometric Multi-Resolution Analysis (GMRA) provides a robust and computationally efficient procedure to construct low-dimensional geometric approximations of $\mathcal{M}$ at varying resolutions. We introduce a thresholding algorithm on the geometric wavelet coefficients, leading to what we call adaptive GMRA approximations. We show that these data-driven, empirical approximations perform well, when the threshold is chosen as a suitable universal function of the number of samples $n$, on a wide variety of measures $ρ$, that are allowed to exhibit different regularity at different scales and locations, thereby efficiently encoding data from more complex measures than those supported on manifolds. These approximations yield a data-driven dictionary, together with a fast transform mapping data to coefficients, and an inverse of such a map. The algorithms for both the dictionary construction and the transforms have complexity $C n \log n$ with the constant linear in $D$ and exponential in $d$. Our work therefore establishes adaptive GMRA as a fast dictionary learning algorithm with approximation guarantees. We include several numerical experiments on both synthetic and real data, confirming our theoretical results and demonstrating the effectiveness of adaptive GMRA.