NASep 2, 2008
Symmetric tensors and symmetric tensor rankPierre Comon, Gene Golub, Lek-Heng Lim et al.
A symmetric tensor is a higher order generalization of a symmetric matrix. In this paper, we study various properties of symmetric tensors in relation to a decomposition into a sum of symmetric outer product of vectors. A rank-1 order-k tensor is the outer product of k non-zero vectors. Any symmetric tensor can be decomposed into a linear combination of rank-1 tensors, each of them being symmetric or not. The rank of a symmetric tensor is the minimal number of rank-1 tensors that is necessary to reconstruct it. The symmetric rank is obtained when the constituting rank-1 tensors are imposed to be themselves symmetric. It is shown that rank and symmetric rank are equal in a number of cases, and that they always exist in an algebraically closed field. We will discuss the notion of the generic symmetric rank, which, due to the work of Alexander and Hirschowitz, is now known for any values of dimension and order. We will also show that the set of symmetric tensors of symmetric rank at most r is not closed, unless r = 1.
NAMay 24, 2013
Multiarray Signal Processing: Tensor decomposition meets compressed sensingLek-Heng Lim, Pierre Comon
We discuss how recently discovered techniques and tools from compressed sensing can be used in tensor decompositions, with a view towards modeling signals from multiple arrays of multiple sensors. We show that with appropriate bounds on a measure of separation between radiating sources called coherence, one could always guarantee the existence and uniqueness of a best rank-r approximation of the tensor representing the signal. We also deduce a computationally feasible variant of Kruskal's uniqueness condition, where the coherence appears as a proxy for k-rank. Problems of sparsest recovery with an infinite continuous dictionary, lowest-rank tensor representation, and blind source separation are treated in a uniform fashion. The decomposition of the measurement tensor leads to simultaneous localization and extraction of radiating sources, in an entirely deterministic manner.
NAFeb 23, 2011
Rank Aggregation via Nuclear Norm MinimizationDavid F. Gleich, Lek-Heng Lim
The process of rank aggregation is intimately intertwined with the structure of skew-symmetric matrices. We apply recent advances in the theory and algorithms of matrix completion to skew-symmetric matrices. This combination of ideas produces a new method for ranking a set of items. The essence of our idea is that a rank aggregation describes a partially filled skew-symmetric matrix. We extend an algorithm for matrix completion to handle skew-symmetric data and use that to extract ranks for each item. Our algorithm applies to both pairwise comparison and rating data. Because it is based on matrix completion, it is robust to both noise and incomplete data. We show a formal recovery result for the noiseless case and present a detailed study of the algorithm on synthetic data and Netflix ratings.
NASep 4, 2014
Multilinear PageRankDavid F. Gleich, Lek-Heng Lim, Yongyang Yu
In this paper, we first extend the celebrated PageRank modification to a higher-order Markov chain. Although this system has attractive theoretical properties, it is computationally intractable for many interesting problems. We next study a computationally tractable approximation to the higher-order PageRank vector that involves a system of polynomial equations called multilinear PageRank, which is a type of tensor PageRank vector. It is motivated by a novel "spacey random surfer" model, where the surfer remembers bits and pieces of history and is influenced by this information. The underlying stochastic process is an instance of a vertex-reinforced random walk. We develop convergence theory for a simple fixed-point method, a shifted fixed-point method, and a Newton iteration in a particular parameter regime. In marked contrast to the case of the PageRank vector of a Markov chain where the solution is always unique and easy to compute, there are parameter regimes of multilinear PageRank where solutions are not unique and simple algorithms do not converge. We provide a repository of these non-convergent cases that we encountered through exhaustive enumeration and randomly sampling that we believe is useful for future study of the problem.
NAFeb 15, 2016
Uniqueness of Nonnegative Tensor ApproximationsYang Qi, Pierre Comon, Lek-Heng Lim
We show that for a nonnegative tensor, a best nonnegative rank-r approximation is almost always unique, its best rank-one approximation may always be chosen to be a best nonnegative rank-one approximation, and that the set of nonnegative tensors with non-unique best rank-one approximations form an algebraic hypersurface. We show that the last part holds true more generally for real tensors and thereby determine a polynomial equation so that a real or nonnegative tensor which does not satisfy this equation is guaranteed to have a unique best rank-one approximation. We also establish an analogue for real or nonnegative symmetric tensors. In addition, we prove a singular vector variant of the Perron--Frobenius Theorem for positive tensors and apply it to show that a best nonnegative rank-r approximation of a positive tensor can never be obtained by deflation. As an aside, we verify that the Euclidean distance (ED) discriminants of the Segre variety and the Veronese variety are hypersurfaces and give defining equations of these ED discriminants.
NADec 26, 2016
The Spacey Random Walk: a Stochastic Process for Higher-order DataAustin R. Benson, David F. Gleich, Lek-Heng Lim
Random walks are a fundamental model in applied mathematics and are a common example of a Markov chain. The limiting stationary distribution of the Markov chain represents the fraction of the time spent in each state during the stochastic process. A standard way to compute this distribution for a random walk on a finite set of states is to compute the Perron vector of the associated transition matrix. There are algebraic analogues of this Perron vector in terms of transition probability tensors of higher-order Markov chains. These vectors are nonnegative, have dimension equal to the dimension of the state space, and sum to one and are derived by making an algebraic substitution in the equation for the joint-stationary distribution of a higher-order Markov chains. Here, we present the spacey random walk, a non-Markovian stochastic process whose stationary distribution is given by the tensor eigenvector. The process itself is a vertex-reinforced random walk, and its discrete dynamics are related to a continuous dynamical system. We analyze the convergence properties of these dynamics and discuss numerical methods for computing the stationary distribution. Finally, we provide several applications of the spacey random walk model in population genetics, ranking, and clustering data, and we use the process to analyze taxi trajectory data in New York. This example shows definite non-Markovian structure.
NAJun 16, 2016
Schubert varieties and distances between subspaces of different dimensionsKe Ye, Lek-Heng Lim
We resolve a basic problem on subspace distances that often arises in applications: How can the usual Grassmann distance between equidimensional subspaces be extended to subspaces of different dimensions? We show that a natural solution is given by the distance of a point to a Schubert variety within the Grassmannian. This distance reduces to the Grassmann distance when the subspaces are equidimensional and does not depend on any embedding into a larger ambient space. Furthermore, it has a concrete expression involving principal angles, and is efficiently computable in numerically stable ways. Our results are largely independent of the Grassmann distance --- if desired, it may be substituted by any other common distances between subspaces. Our approach depends on a concrete algebraic geometric view of the Grassmannian that parallels the differential geometric perspective that is well-established in applied and computational mathematics.
NAJun 4, 2018
Geometric distance between positive definite matrices of different dimensionsLek-Heng Lim, Rodolphe Sepulchre, Ke Ye
We show how the Riemannian distance on $\mathbb{S}^n_{++}$, the cone of $n\times n$ real symmetric or complex Hermitian positive definite matrices, may be used to naturally define a distance between two such matrices of different dimensions. Given that $\mathbb{S}^n_{++}$ also parameterizes $n$-dimensional ellipsoids, and inner products on $\mathbb{R}^n$, $n \times n$ covariance matrices of nondegenerate probability distributions, this gives us a natural way to define a geometric distance between a pair of such objects of different dimensions.
NAFeb 9, 2019
Tensor network ranksKe Ye, Lek-Heng Lim
In problems involving approximation, completion, denoising, dimension reduction, estimation, interpolation, modeling, order reduction, regression, etc, we argue that the near-universal practice of assuming that a function, matrix, or tensor (which we will see are all the same object in this context) has \emph{low rank} may be ill-justified. There are many natural instances where the object in question has high rank with respect to the classical notions of rank: matrix rank, tensor rank, multilinear rank --- the latter two being the most straightforward generalizations of the former. To remedy this, we show that one may vastly expand these classical notions of ranks: Given any undirected graph $G$, there is a notion of $G$-rank associated with $G$, which provides us with as many different kinds of ranks as there are undirected graphs. In particular, the popular tensor network states in physics (e.g., \textsc{mps}, \textsc{ttns}, \textsc{peps}) may be regarded as functions of a specific $G$-rank for various choices of $G$. Among other things, we will see that a function, matrix, or tensor may have very high matrix, tensor, or multilinear rank and yet very low $G$-rank for some $G$. In fact the difference is in the orders of magnitudes and the gaps between $G$-ranks and these classical ranks are arbitrarily large for some important objects in computer science, mathematics, and physics. Furthermore, we show that there is a $G$ such that almost every tensor has $G$-rank exponentially lower than its rank or the dimension of its ambient space.
AGApr 22, 2018
Topology of tensor ranksPierre Comon, Lek-Heng Lim, Yang Qi et al.
We study path-connectedness and homotopy groups of sets of tensors defined by tensor rank, border rank, multilinear rank, as well as their symmetric counterparts for symmetric tensors. We show that over $\mathbb{C}$, the set of rank-$r$ tensors and the set of symmetric rank-$r$ symmetric tensors are both path-connected if $r$ is not more than the complex generic rank; these results also extend to border rank and symmetric border rank over $\mathbb{C}$. Over $\mathbb{R}$, the set of rank-$r$ tensors is path-connected if it has the expected dimension but the corresponding result for symmetric rank-$r$ symmetric $d$-tensors depends on the order $d$: connected when $d$ is odd but not when $d$ is even. Border rank and symmetric border rank over $\mathbb{R}$ have essentially the same path-connectedness properties as rank and symmetric rank over $\mathbb{R}$. When $r$ is greater than the complex generic rank, we are unable to discern any general pattern: For example, we show that border-rank-three tensors in $\mathbb{R}^2 \otimes \mathbb{R}^2 \otimes \mathbb{R}^2$ fall into four connected components. For multilinear rank, the manifold of $d$-tensors of multilinear rank $(r_1,\dots,r_d)$ in $\mathbb{C}^{n_1} \otimes \cdots \otimes \mathbb{C}^{n_d}$ is always path-connected, and the same is true in $\mathbb{R}^{n_1} \otimes \cdots \otimes \mathbb{R}^{n_d}$ unless $n_i = r_i = \prod_{j \ne i} r_j$ for some $i\in\{1, \dots, d\}$. Beyond path-connectedness, we determine, over both $\mathbb{R}$ and $\mathbb{C}$, the fundamental and higher homotopy groups of the set of tensors of a fixed small rank, and, taking advantage of Bott periodicity, those of the manifold of tensors of a fixed multilinear rank. We also obtain analogues of these results for symmetric tensors of a fixed symmetric rank or a fixed symmetric multilinear rank.
NAOct 9, 2017
Fast randomized iteration: diffusion Monte Carlo through the lens of numerical linear algebraLek-Heng Lim, Jonathan Weare
We review the basic outline of the highly successful diffusion Monte Carlo technique commonly used in contexts ranging from electronic structure calculations to rare event simulation and data assimilation, and propose a new class of randomized iterative algorithms based on similar principles to address a variety of common tasks in numerical linear algebra. From the point of view of numerical linear algebra, the main novelty of the Fast Randomized Iteration schemes described in this article is that they work in either linear or constant cost per iteration (and in total, under appropriate conditions) and are rather versatile: we will show how they apply to solution of linear systems, eigenvalue problems, and matrix exponentiation, in dimensions far beyond the present limits of numerical linear algebra. While traditional iterative methods in numerical linear algebra were created in part to deal with instances where a matrix (of size $\mathcal{O}(n^2)$) is too big to store, the algorithms that we propose are effective even in instances where the solution vector itself (of size $\mathcal{O}(n)$) may be too big to store or manipulate. In fact, our work is motivated by recent DMC based quantum Monte Carlo schemes that have been applied to matrices as large as $10^{108} \times 10^{108}$. We provide basic convergence results, discuss the dependence of these results on the dimension of the system, and demonstrate dramatic cost savings on a range of test problems.
LGMay 15, 2022
What is an equivariant neural network?Lek-Heng Lim, Bradley J. Nelson
We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision to AlphaFold 2 for protein structure prediction, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We extract and focus on the mathematical aspects, and limit ourselves to a cursory treatment of the engineering issues at the end.
NAJun 9, 2016
Fast structured matrix computations: tensor rank and Cohn--Umans methodKe Ye, Lek-Heng Lim
We discuss a generalization of the Cohn-Umans method, a potent technique developed for studying the bilinear complexity of matrix multiplication by embedding matrices into an appropriate group algebra. We investigate how the Cohn-Umans method may be used for bilinear operations other than matrix multiplication, with algebras other than group algebras, and we relate it to Strassen's tensor rank approach, the traditional framework for investigating bilinear complexity. To demonstrate the utility of the generalized method, we apply it to find the fastest algorithms for forming structured matrix-vector product, the basic operation underlying iterative algorithms for structured matrices. The structures we study include Toeplitz, Hankel, circulant, symmetric, skew-symmetric, f-circulant, block-Toeplitz-Toeplitz-block, triangular Toeplitz matrices, Toeplitz-plus-Hankel, sparse/banded/triangular. Except for the case of skew-symmetric matrices, for which we have only upper bounds, the algorithms derived using the generalized Cohn-Umans method in all other instances are the fastest possible in the sense of having minimum bilinear complexity. We also apply this framework to a few other bilinear operations including matrix-matrix, commutator, simultaneous matrix products, and briefly discuss the relation between tensor nuclear norm and numerical stability.
OCDec 18, 2018
Semi-Riemannian Manifold OptimizationTingran Gao, Lek-Heng Lim, Ke Ye
We introduce in this paper a manifold optimization framework that utilizes semi-Riemannian structures on the underlying smooth manifolds. Unlike in Riemannian geometry, where each tangent space is equipped with a positive definite inner product, a semi-Riemannian manifold allows the metric tensor to be indefinite on each tangent space, i.e., possessing both positive and negative definite subspaces; differential geometric objects such as geodesics and parallel-transport can be defined on non-degenerate semi-Riemannian manifolds as well, and can be carefully leveraged to adapt Riemannian optimization algorithms to the semi-Riemannian setting. In particular, we discuss the metric independence of manifold optimization algorithms, and illustrate that the weaker but more general semi-Riemannian geometry often suffices for the purpose of optimizing smooth functions on smooth manifolds in practice.
28.8NAMay 28
Generalized matrix nearness problems IIRongbiao Thomas Wang, Chi-Kwong Li, Lek-Heng Lim
Given a matrix $A$, a matrix nearness problem seeks an $X$ that most closely approximates $A$ in the sense of minimizing $\lVert A - X\rVert$ under a variety of constraints on $X$. A generalized matrix nearness problem seeks the same but with three given matrices $A,B,C$ and $\lVert A - BXC\rVert$ in place of $\lVert A - X\rVert$. We extend previous studies of the latter problem in three directions: incorporating an affine term, replacing matrix product by Kronecker product in various manners, and generalizing Frobenius norm to any orthogonally invariant norm. We will solve several of these in closed form. For the rest, we develop an iterative algorithm that works for any Schatten norm, proving that it converges to a global minimizer regardless of the initial point. In addition, the algorithm relies purely on numerical linear algebra, and notably does not compute any explicit gradients or subgradients. Along the way, we will also show that there is no Mirsky-type theorem for rank constrained generalized matrix nearness problems.
LGNov 25, 2022
LU decomposition and Toeplitz decomposition of a neural networkYucong Liu, Simiao Jiao, Lek-Heng Lim
It is well-known that any matrix $A$ has an LU decomposition. Less well-known is the fact that it has a 'Toeplitz decomposition' $A = T_1 T_2 \cdots T_r$ where $T_i$'s are Toeplitz matrices. We will prove that any continuous function $f : \mathbb{R}^n \to \mathbb{R}^m$ has an approximation to arbitrary accuracy by a neural network that takes the form $L_1 σ_1 U_1 σ_2 L_2 σ_3 U_2 \cdots L_r σ_{2r-1} U_r$, i.e., where the weight matrices alternate between lower and upper triangular matrices, $σ_i(x) := σ(x - b_i)$ for some bias vector $b_i$, and the activation $σ$ may be chosen to be essentially any uniformly continuous nonpolynomial function. The same result also holds with Toeplitz matrices, i.e., $f \approx T_1 σ_1 T_2 σ_2 \cdots σ_{r-1} T_r$ to arbitrary accuracy, and likewise for Hankel matrices. A consequence of our Toeplitz result is a fixed-width universal approximation theorem for convolutional neural networks, which so far have only arbitrary width versions. Since our results apply in particular to the case when $f$ is a general neural network, we may regard them as LU and Toeplitz decompositions of a neural network. The practical implication of our results is that one may vastly reduce the number of weight parameters in a neural network without sacrificing its power of universal approximation. We will present several experiments on real data sets to show that imposing such structures on the weight matrices sharply reduces the number of training parameters with almost no noticeable effect on test accuracy.
NAMar 22, 2016
Algorithms for structured matrix-vector product of optimal bilinear complexityKe Ye, Lek-Heng Lim
We present explicit algorithms for computing structured matrix-vector products that are optimal in the sense of Strassen, i.e., using a provably minimum number of multiplications. These structures include Toeplitz/Hankel/circulant, symmetric, Toeplitz-plus-Hankel, sparse, and multilevel structures. The last category include \textsc{bttb}, \textsc{bhhb}, \textsc{bccb} but also any arbitrarily complicated nested structures built out of other structures.
OCNov 28, 2022
Stochastic Steffensen methodMinda Zhao, Zehua Lai, Lek-Heng Lim
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method -- note that this is not true for SGD or SLBFGS -- and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
NANov 3, 2017
Accurate Solutions of Polynomial Eigenvalue ProblemsYiling You, Jose Israel Rodriguez, Lek-Heng Lim
Quadratic eigenvalue problems (QEP) and more generally polynomial eigenvalue problems (PEP) are among the most common types of nonlinear eigenvalue problems. Both problems, especially the QEP, have extensive applications. A typical approach to solve QEP and PEP is to use a linearization method to reformulate the problem as a higher dimensional linear eigenvalue problem. In this article, we use homotopy continuation to solve these nonlinear eigenvalue problems without passing to higher dimensions. Our main contribution is to show that our method produces substantially more accurate results, and finds all eigenvalues with a certificate of correctness via Smale's $α$-theory. To explain the superior accuracy, we show that the nonlinear eigenvalue problem we solve is better conditioned than its reformulated linear eigenvalue problem, and our homotopy continuation algorithm is more stable than QZ algorithm - theoretical findings that are borne out by our numerical experiments. Our studies provide yet another illustration of the dictum in numerical analysis that, for reasons of conditioning and stability, it is sometimes better to solve a nonlinear problem directly even when it could be transformed into a linear problem with the same solution mathematically.
AIAug 19, 2024
Attention is a smoothed cubic splineZehua Lai, Lek-Heng Lim, Yucong Liu
We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components -- encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself -- are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just $C^2$, one way to obtain a smoothed $C^\infty$-version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.
STMar 21, 2025
Glivenko-Cantelli for $f$-divergenceHaoming Wang, Lek-Heng Lim
We extend the celebrated Glivenko-Cantelli theorem, sometimes called the fundamental theorem of statistics, from its standard setting of total variation distance to all $f$-divergences. A key obstacle in this endeavor is to define $f$-divergence on a subcollection of a $σ$-algebra that forms a $π$-system but not a $σ$-subalgebra. This is a side contribution of our work. We will show that this notion of $f$-divergence on the $π$-system of rays preserves nearly all known properties of standard $f$-divergence, yields a novel integral representation of the Kolmogorov-Smirnov distance, and has a Glivenko-Cantelli theorem. We will also discuss the prospects of a Vapnik-Chervonenkis theory for $f$-divergence.
OCJun 2, 2020
Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is FalseZehua Lai, Lek-Heng Lim
Stochastic optimization algorithms have become indispensable in modern machine learning. An unresolved foundational question in this area is the difference between with-replacement sampling and without-replacement sampling -- does the latter have superior convergence rate compared to the former? A groundbreaking result of Recht and Ré reduces the problem to a noncommutative analogue of the arithmetic-geometric mean inequality where $n$ positive numbers are replaced by $n$ positive definite matrices. If this inequality holds for all $n$, then without-replacement sampling indeed outperforms with-replacement sampling. The conjectured Recht-Ré inequality has so far only been established for $n = 2$ and a special case of $n = 3$. We will show that the Recht-Ré conjecture is false for general $n$. Our approach relies on the noncommutative Positivstellensatz, which allows us to reduce the conjectured inequality to a semidefinite program and the validity of the conjecture to certain bounds for the optimum values, which we show are false as soon as $n = 5$.
LGApr 13, 2020
Topology of deep neural networksGregory Naitzat, Andrey Zhitnikov, Lek-Heng Lim
We study how the topology of a data set $M = M_a \cup M_b \subseteq \mathbb{R}^d$, representing two classes $a$ and $b$ in a binary classification problem, changes as it passes through the layers of a well-trained neural network, i.e., with perfect accuracy on training set and near-zero generalization error ($\approx 0.01\%$). The goal is to shed light on two mysteries in deep neural networks: (i) a nonsmooth activation function like ReLU outperforms a smooth one like hyperbolic tangent; (ii) successful neural network architectures rely on having many layers, even though a shallow network can approximate any function arbitrary well. We performed extensive experiments on the persistent homology of a wide range of point cloud data sets, both real and simulated. The results consistently demonstrate the following: (1) Neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simple one as it passes through the layers. No matter how complicated the topology of $M$ we begin with, when passed through a well-trained neural network $f : \mathbb{R}^d \to \mathbb{R}^p$, there is a vast reduction in the Betti numbers of both components $M_a$ and $M_b$; in fact they nearly always reduce to their lowest possible values: $β_k\bigl(f(M_i)\bigr) = 0$ for $k \ge 1$ and $β_0\bigl(f(M_i)\bigr) = 1$, $i =a, b$. Furthermore, (2) the reduction in Betti numbers is significantly faster for ReLU activation than hyperbolic tangent activation as the former defines nonhomeomorphic maps that change topology, whereas the latter defines homeomorphic maps that preserve topology. Lastly, (3) shallow and deep networks transform data sets differently -- a shallow network operates mainly through changing geometry and changes topology only in its final layers, a deep one spreads topological changes more evenly across all layers.
LGJul 2, 2019
Best k-layer neural network approximationsLek-Heng Lim, Mateusz Michalek, Yang Qi
We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set $s_1, \dots, s_n \in \mathbb{R}^p$ with corresponding responses $t_1,\dots,t_n \in \mathbb{R}^q$, fitting a $k$-layer neural network $ν_θ: \mathbb{R}^p \to \mathbb{R}^q$ involves estimation of the weights $θ\in \mathbb{R}^m$ via an ERM: \[ \inf_{θ\in \mathbb{R}^m} \; \sum_{i=1}^n \lVert t_i - ν_θ(s_i) \rVert_2^2. \] We show that even for $k = 2$, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. A high-level explanation is like that for the nonexistence of best rank-$r$ approximations of higher-order tensors --- the set of parameters is not a closed set --- but the geometry involved for best $k$-layer neural networks approximations is more subtle. In addition, we show that for smooth activations $σ(x)= 1/\bigl(1 + \exp(-x)\bigr)$ and $σ(x)=\tanh(x)$, such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation $σ(x)=\max(0,x)$, we completely classifying cases where the ERM for a best two-layer neural network approximation attains its infimum. As an aside, we obtain a precise description of the geometry of the space of two-layer neural networks with $d$ neurons in the hidden layer: it is the join locus of a line and the $d$-secant locus of a cone.
NASep 6, 2018
Complex best $r$-term approximations almost always exist in finite dimensionsYang Qi, Mateusz Michałek, Lek-Heng Lim
We show that in finite-dimensional nonlinear approximations, the best $r$-term approximant of a function $f$ almost always exists over $\mathbb{C}$ but that the same is not true over $\mathbb{R}$, i.e., the infimum $\inf_{f_1,\dots,f_r \in Y} \lVert f - f_1 - \dots - f_r \rVert$ is almost always attainable by complex-valued functions $f_1,\dots, f_r$ in $Y$, a set of functions that have some desired structures. Our result extends to functions that possess special properties like symmetry or skew-symmetry under permutations of arguments. For the case where $Y$ is the set of separable functions, the problem becomes that of best rank-$r$ tensor approximations. We show that over $\mathbb{C}$, any tensor almost always has a unique best rank-$r$ approximation. This extends to other notions of tensor ranks such as symmetric rank and alternating rank, to best $r$-block-terms approximations, and to best approximations by tensor networks. When applied to sparse-plus-low-rank approximations, we obtain that for any given $r$ and $k$, a general tensor has a unique best approximation by a sum of a rank-$r$ tensor and a $k$-sparse tensor with a fixed sparsity pattern; this arises in, for example, estimation of covariance matrices of a Gaussian hidden variable model with $k$ observed variables conditionally independent given $r$ hidden variables. The existential (but not the uniqueness) part of our result also applies to best approximations by a sum of a rank-$r$ tensor and a $k$-sparse tensor with no fixed sparsity pattern, as well as to tensor completion problems.
LGMay 18, 2018
Tropical Geometry of Deep Neural NetworksLiwen Zhang, Gregory Naitzat, Lek-Heng Lim
We establish, for the first time, connections between feedforward neural networks with ReLU activation and tropical geometry --- we show that the family of such neural networks is equivalent to the family of tropical rational maps. Among other things, we deduce that feedforward ReLU neural networks with one hidden layer can be characterized by zonotopes, which serve as building blocks for deeper networks; we relate decision boundaries of such neural networks to tropical hypersurfaces, a major object of study in tropical geometry; and we prove that linear regions of such neural networks correspond to vertices of polytopes associated with tropical rational functions. An insight from our tropical formulation is that a deeper network is exponentially more expressive than a shallow network.
CVApr 5, 2016
Cohomology of Cryo-Electron MicroscopyKe Ye, Lek-Heng Lim
The goal of cryo-electron microscopy (EM) is to reconstruct the 3-dimensional structure of a molecule from a collection of its 2-dimensional projected images. In this article, we show that the basic premise of cryo-EM --- patching together 2-dimensional projections to reconstruct a 3-dimensional object --- is naturally one of Cech cohomology with SO(2)-coefficients. We deduce that every cryo-EM reconstruction problem corresponds to an oriented circle bundle on a simplicial complex, allowing us to classify cryo-EM problems via principal bundles. In practice, the 2-dimensional images are noisy and a main task in cryo-EM is to denoise them. We will see how the aforementioned insights can be used towards this end.
OCMar 31, 2012
PARNES: A rapidly convergent algorithm for accurate recovery of sparse and approximately sparse signalsMing Gu, Lek-Heng Lim, Cinna Julie Wu
In this article, we propose an algorithm, NESTA-LASSO, for the LASSO problem, i.e., an underdetermined linear least-squares problem with a 1-norm constraint on the solution. We prove under the assumption of the restricted isometry property (RIP) and a sparsity condition on the solution, that NESTA-LASSO is guaranteed to be almost always locally linearly convergent. As in the case of the algorithm NESTA proposed by Becker, Bobin, and Candes, we rely on Nesterov's accelerated proximal gradient method, which takes O(e^{-1/2}) iterations to come within e > 0 of the optimal value. We introduce a modification to Nesterov's method that regularly updates the prox-center in a provably optimal manner, and the aforementioned linear convergence is in part due to this modification. In the second part of this article, we attempt to solve the basis pursuit denoising BPDN problem (i.e., approximating the minimum 1-norm solution to an underdetermined least squares problem) by using NESTA-LASSO in conjunction with the Pareto root-finding method employed by van den Berg and Friedlander in their SPGL1 solver. The resulting algorithm is called PARNES. We provide numerical evidence to show that it is comparable to currently available solvers.
OCMay 30, 2010
Quasi-Newton methods on Grassmannians and multilinear approximations of tensorsBerkant Savas, Lek-Heng Lim
In this paper we proposed quasi-Newton and limited memory quasi-Newton methods for objective functions defined on Grassmannians or a product of Grassmannians. Specifically we defined BFGS and L-BFGS updates in local and global coordinates on Grassmannians or a product of these. We proved that, when local coordinates are used, our BFGS updates on Grassmannians share the same optimality property as the usual BFGS updates on Euclidean spaces. When applied to the best multilinear rank approximation problem for general and symmetric tensors, our approach yields fast, robust, and accurate algorithms that exploit the special Grassmannian structure of the respective problems, and which work on tensors of large dimensions and arbitrarily high order. Extensive numerical experiments are included to substantiate our claims.
NAApr 12, 2009
Nonnegative approximations of nonnegative tensorsLek-Heng Lim, Pierre Comon
We study the decomposition of a nonnegative tensor into a minimal sum of outer product of nonnegative vectors and the associated parsimonious naive Bayes probabilistic model. We show that the corresponding approximation problem, which is central to nonnegative PARAFAC, will always have optimal solutions. The result holds for any choice of norms and, under a mild assumption, even Bregman divergences.
NAApr 1, 2008
Tensor rank and the ill-posedness of the best low-rank approximation problemVin de Silva, Lek-Heng Lim
There has been continued interest in seeking a theorem describing optimal low-rank approximations to tensors of order 3 or higher, that parallels the Eckart-Young theorem for matrices. In this paper, we argue that the naive approach to this problem is doomed to failure because, unlike matrices, tensors of order 3 or higher can fail to have best rank-r approximations. The phenomenon is much more widespread than one might suspect: examples of this failure can be constructed over a wide range of dimensions, orders and ranks, regardless of the choice of norm (or even Bregman divergence). Moreover, we show that in many instances these counterexamples have positive volume: they cannot be regarded as isolated phenomena. In one extreme case, we exhibit a tensor space in which no rank-3 tensor has an optimal rank-2 approximation. The notable exceptions to this misbehavior are rank-1 tensors and order-2 tensors. In a more positive spirit, we propose a natural way of overcoming the ill-posedness of the low-rank approximation problem, by using weak solutions when true solutions do not exist. In our work we emphasize the importance of closely studying concrete low-dimensional examples as a first step towards more general results. To this end, we present a detailed analysis of equivalence classes of 2-by-2-by-2 tensors, and we develop methods for extending results upwards to higher orders and dimensions. Finally, we link our work to existing studies of tensors from an algebraic geometric point of view. The rank of a tensor can in theory be given a semialgebraic description; i.e., can be determined by a system of polynomial inequalities. In particular we make extensive use of the 2-by-2-by-2 hyperdeterminant.