Helmut Bölcskei

LG
h-index2
31papers
1,473citations
Novelty55%
AI Score43

31 Papers

AIJan 30
Complete Identification of Deep ReLU Neural Networks by Many-Valued Logic

Yani Zhang, Helmut Bölcskei

Deep ReLU neural networks admit nontrivial functional symmetries: vastly different architectures and parameters (weights and biases) can realize the same function. We address the complete identification problem -- given a function f, deriving the architecture and parameters of all feedforward ReLU networks giving rise to f. We translate ReLU networks into Lukasiewicz logic formulae, and effect functional equivalent network transformations through algebraic rewrites governed by the logic axioms. A compositional norm form is proposed to facilitate the mapping from Lukasiewicz logic formulae back to ReLU networks. Using Chang's completeness theorem, we show that for every functional equivalence class, all ReLU networks in that class are connected by a finite set of symmetries corresponding to the finite set of axioms of Lukasiewicz logic. This idea is reminiscent of Shannon's seminal work on switching circuit design, where the circuits are translated into Boolean formulae, and synthesis is effected by algebraic rewriting governed by Boolean logic axioms.

LGJul 1, 2024
Metric-Entropy Limits on the Approximation of Nonlinear Dynamical Systems

Yang Pan, Clemens Hutter, Helmut Bölcskei

This paper is concerned with fundamental limits on the approximation of nonlinear dynamical systems. Specifically, we show that recurrent neural networks (RNNs) can approximate nonlinear systems -- that satisfy a Lipschitz property and forget past inputs fast enough -- in metric-entropy-optimal manner. As the sets of sequence-to-sequence mappings realized by the dynamical systems we consider are significantly more massive than function classes generally analyzed in approximation theory, a refined metric-entropy characterization is needed, namely in terms of order, type, and generalized dimension. We compute these quantities for the classes of exponentially- and polynomially Lipschitz fading-memory systems and show that RNNs can achieve them.

AIApr 8, 2024
Cellular automata, many-valued logic, and deep neural networks

Yani Zhang, Helmut Bölcskei

We develop a theory characterizing the fundamental capability of deep neural networks to learn, from evolution traces, the logical rules governing the behavior of cellular automata (CA). This is accomplished by first establishing a novel connection between CA and Lukasiewicz propositional logic. While binary CA have been known for decades to essentially perform operations in Boolean logic, no such relationship exists for general CA. We demonstrate that many-valued (MV) logic, specifically Lukasiewicz propositional logic, constitutes a suitable language for characterizing general CA as logical machines. This is done by interpolating CA transition functions to continuous piecewise linear functions, which, by virtue of the McNaughton theorem, yield formulae in MV logic characterizing the CA. Recognizing that deep rectified linear unit (ReLU) networks realize continuous piecewise linear functions, it follows that these formulae are naturally extracted from CA evolution traces by deep ReLU networks. A corresponding algorithm together with a software implementation is provided. Finally, we show that the dynamical behavior of CA can be realized by recurrent neural networks.

LGDec 6, 2024
Generating Rectifiable Measures through Neural Networks

Erwin Riegler, Alex Bühler, Yang Pan et al.

We derive universal approximation results for the class of (countably) $m$-rectifiable measures. Specifically, we prove that $m$-rectifiable measures can be approximated as push-forwards of the one-dimensional Lebesgue measure on $[0,1]$ using ReLU neural networks with arbitrarily small approximation error in terms of Wasserstein distance. What is more, the weights in the networks under consideration are quantized and bounded and the number of ReLU neural networks required to achieve an approximation error of $\varepsilon$ is no larger than $2^{b(\varepsilon)}$ with $b(\varepsilon)=\mathcal{O}(\varepsilon^{-m}\log^2(\varepsilon))$. This result improves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which $b(\varepsilon)$ tends to infinity as $\varepsilon$ tends to zero equals the rectifiability parameter $m$, which can be much smaller than the ambient dimension. We extend this result to countably $m$-rectifiable measures and show that this rate still equals the rectifiability parameter $m$ provided that, among other technical assumptions, the measure decays exponentially on the individual components of the countably $m$-rectifiable support set.

MLMay 3, 2024
Three Quantization Regimes for ReLU Networks

Weigutian Ou, Philipp Schenkel, Helmut Bölcskei

We establish the fundamental limits in the approximation of Lipschitz functions by deep ReLU neural networks with finite-precision weights. Specifically, three regimes, namely under-, over-, and proper quantization, in terms of minimax approximation error behavior as a function of network weight precision, are identified. This is accomplished by deriving nonasymptotic tight lower and upper bounds on the minimax approximation error. Notably, in the proper-quantization regime, neural networks exhibit memory-optimality in the approximation of Lipschitz functions. Deep networks have an inherent advantage over shallow networks in achieving memory-optimality. We also develop the notion of depth-precision tradeoff, showing that networks with high-precision weights can be converted into functionally equivalent deeper networks with low-precision weights, while preserving memory-optimality. This idea is reminiscent of sigma-delta analog-to-digital conversion, where oversampling rate is traded for resolution in the quantization of signal samples. We improve upon the best-known ReLU network approximation results for Lipschitz functions and describe a refinement of the bit extraction technique which could be of independent general interest.

LGJan 22, 2024
Extracting Formulae in Many-Valued Logic from Deep Neural Networks

Yani Zhang, Helmut Bölcskei

We propose a new perspective on deep ReLU networks, namely as circuit counterparts of Lukasiewicz infinite-valued logic -- a many-valued (MV) generalization of Boolean logic. An algorithm for extracting formulae in MV logic from deep ReLU networks is presented. As the algorithm applies to networks with general, in particular also real-valued, weights, it can be used to extract logical formulae from deep ReLU networks trained on data.

LGJul 26, 2021
High-Dimensional Distribution Generation Through Deep Neural Networks

Dmytro Perekrestenko, Léandre Eberhard, Helmut Bölcskei

We show that every $d$-dimensional probability distribution of bounded support can be generated through deep ReLU networks out of a $1$-dimensional uniform input distribution. What is more, this is possible without incurring a cost - in terms of approximation error measured in Wasserstein-distance - relative to generating the $d$-dimensional target distribution from $d$ independent random variables. This is enabled by a vast generalization of the space-filling approach discovered in (Bailey & Telgarsky, 2018). The construction we propose elicits the importance of network depth in driving the Wasserstein distance between the target distribution and its neural network approximation to zero. Finally, we find that, for histogram target distributions, the number of bits needed to encode the corresponding generative network equals the fundamental limit for encoding probability distributions as dictated by quantization theory.

LGMay 6, 2021
Metric Entropy Limits on Recurrent Neural Network Learning of Linear Dynamical Systems

Clemens Hutter, Recep Gül, Helmut Bölcskei

One of the most influential results in neural network theory is the universal approximation theorem [1, 2, 3] which states that continuous functions can be approximated to within arbitrary accuracy by single-hidden-layer feedforward neural networks. The purpose of this paper is to establish a result in this spirit for the approximation of general discrete-time linear dynamical systems - including time-varying systems - by recurrent neural networks (RNNs). For the subclass of linear time-invariant (LTI) systems, we devise a quantitative version of this statement. Specifically, measuring the complexity of the considered class of LTI systems through metric entropy according to [4], we show that RNNs can optimally learn - or identify in system-theory parlance - stable LTI systems. For LTI systems whose input-output relation is characterized through a difference equation, this means that RNNs can learn the difference equation from input-output traces in a metric-entropy optimal manner.

LGJun 30, 2020
Constructive Universal High-Dimensional Distribution Generation through Deep ReLU Networks

Dmytro Perekrestenko, Stephan Müller, Helmut Bölcskei

We present an explicit deep neural network construction that transforms uniformly distributed one-dimensional noise into an arbitrarily close approximation of any two-dimensional Lipschitz-continuous target distribution. The key ingredient of our design is a generalization of the "space-filling" property of sawtooth functions discovered in (Bailey & Telgarsky, 2018). We elicit the importance of depth - in our neural network construction - in driving the Wasserstein distance between the target distribution and the approximation realized by the network to zero. An extension to output distributions of arbitrary dimension is outlined. Finally, we show that the proposed construction does not incur a cost - in terms of error measured in Wasserstein-distance - relative to generating $d$-dimensional target distributions from $d$ independent random variables.

ITJun 21, 2020
Affine symmetries and neural network identifiability

Verner Vlačić, Helmut Bölcskei

We address the following question of neural network identifiability: Suppose we are given a function $f:\mathbb{R}^m\to\mathbb{R}^n$ and a nonlinearity $ρ$. Can we specify the architecture, weights, and biases of all feed-forward neural networks with respect to $ρ$ giving rise to $f$? Existing literature on the subject suggests that the answer should be yes, provided we are only concerned with finding networks that satisfy certain "genericity conditions". Moreover, the identified networks are mutually related by symmetries of the nonlinearity. For instance, the $\tanh$ function is odd, and so flipping the signs of the incoming and outgoing weights of a neuron does not change the output map of the network. The results known hitherto, however, apply either to single-layer networks, or to networks satisfying specific structural assumptions (such as full connectivity), as well as to specific nonlinearities. In an effort to answer the identifiability question in greater generality, we consider arbitrary nonlinearities with potentially complicated affine symmetries, and we show that the symmetries can be used to find a rich set of networks giving rise to the same function $f$. The set obtained in this manner is, in fact, exhaustive (i.e., it contains all networks giving rise to $f$) unless there exists a network $\mathcal{A}$ "with no internal symmetries" giving rise to the identically zero function. This result can thus be interpreted as an analog of the rank-nullity theorem for linear operators. We furthermore exhibit a class of "$\tanh$-type" nonlinearities (including the tanh function itself) for which such a network $\mathcal{A}$ does not exist, thereby solving the identifiability question for these nonlinearities in full generality. Finally, we show that this class contains nonlinearities with arbitrarily complicated symmetries.

COJun 11, 2019
Neural network identifiability for a family of sigmoidal nonlinearities

Verner Vlačić, Helmut Bölcskei

This paper addresses the following question of neural network identifiability: Does the input-output map realized by a feed-forward neural network with respect to a given nonlinearity uniquely specify the network architecture, weights, and biases? Existing literature on the subject Sussman 1992, Albertini, Sontag et al. 1993, Fefferman 1994 suggests that the answer should be yes, up to certain symmetries induced by the nonlinearity, and provided the networks under consideration satisfy certain "genericity conditions". The results in Sussman 1992 and Albertini, Sontag et al. 1993 apply to networks with a single hidden layer and in Fefferman 1994 the networks need to be fully connected. In an effort to answer the identifiability question in greater generality, we derive necessary genericity conditions for the identifiability of neural networks of arbitrary depth and connectivity with an arbitrary nonlinearity. Moreover, we construct a family of nonlinearities for which these genericity conditions are minimal, i.e., both necessary and sufficient. This family is large enough to approximate many commonly encountered nonlinearities to within arbitrary precision in the uniform norm.

LGJan 8, 2019
Deep Neural Network Approximation Theory

Dennis Elbrächter, Dmytro Perekrestenko, Philipp Grohs et al.

This paper develops fundamental limits of deep neural network learning by characterizing what is possible if no constraints are imposed on the learning algorithm and on the amount of training data. Concretely, we consider Kolmogorov-optimal approximation through deep neural networks with the guiding theme being a relation between the complexity of the function (class) to be approximated and the complexity of the approximating network in terms of connectivity and memory requirements for storing the network topology and the associated quantized weights. The theory we develop establishes that deep networks are Kolmogorov-optimal approximants for markedly different function classes, such as unit balls in Besov spaces and modulation spaces. In addition, deep networks provide exponential approximation accuracy - i.e., the approximation error decays exponentially in the number of nonzero weights in the network - of the multiplication operation, polynomials, sinusoidal functions, and certain smooth functions. Moreover, this holds true even for one-dimensional oscillatory textures and the Weierstrass function - a fractal function, neither of which has previously known methods achieving exponential approximation accuracy. We also show that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks.

LGJun 5, 2018
The universal approximation power of finite-width deep ReLU networks

Dmytro Perekrestenko, Philipp Grohs, Dennis Elbrächter et al.

We show that finite-width deep ReLU neural networks yield rate-distortion optimal approximation (Bölcskei et al., 2018) of polynomials, windowed sinusoidal functions, one-dimensional oscillatory textures, and the Weierstrass function, a fractal function which is continuous but nowhere differentiable. Together with their recently established universal approximation property of affine function systems (Bölcskei et al., 2018), this shows that deep neural networks approximate vastly different signal structures generated by the affine group, the Weyl-Heisenberg group, or through warping, and even certain fractals, all with approximation error decaying exponentially in the number of neurons. We also prove that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks.

MLJul 10, 2017
Topology Reduction in Deep Convolutional Feature Extraction Networks

Thomas Wiatowski, Philipp Grohs, Helmut Bölcskei

Deep convolutional neural networks (CNNs) used in practice employ potentially hundreds of layers and $10$,$000$s of nodes. Such network sizes entail significant computational complexity due to the large number of convolutions that need to be carried out; in addition, a large number of parameters needs to be learned and stored. Very deep and wide CNNs may therefore not be well suited to applications operating under severe resource constraints as is the case, e.g., in low-power embedded and mobile platforms. This paper aims at understanding the impact of CNN topology, specifically depth and width, on the network's feature extraction capabilities. We address this question for the class of scattering networks that employ either Weyl-Heisenberg filters or wavelets, the modulus non-linearity, and no pooling. The exponential feature map energy decay results in Wiatowski et al., 2017, are generalized to $\mathcal{O}(a^{-N})$, where an arbitrary decay factor $a>1$ can be realized through suitable choice of the Weyl-Heisenberg prototype function or the mother wavelet. We then show how networks of fixed (possibly small) depth $N$ can be designed to guarantee that $((1-\varepsilon)\cdot 100)\%$ of the input signal's energy are contained in the feature vector. Based on the notion of operationally significant nodes, we characterize, partly rigorously and partly heuristically, the topology-reducing effects of (effectively) band-limited input signals, band-limited filters, and feature map symmetries. Finally, for networks based on Weyl-Heisenberg filters, we determine the prototype function bandwidth that minimizes---for fixed network depth $N$---the average number of operationally significant nodes per layer.

ITAug 3, 2017
Vandermonde Matrices with Nodes in the Unit Disk and the Large Sieve

Céline Aubel, Helmut Bölcskei

We derive bounds on the extremal singular values and the condition number of NxK, with N>=K, Vandermonde matrices with nodes in the unit disk. The mathematical techniques we develop to prove our main results are inspired by a link---first established by by Selberg [1] and later extended by Moitra [2]---between the extremal singular values of Vandermonde matrices with nodes on the unit circle and large sieve inequalities. Our main conceptual contribution lies in establishing a connection between the extremal singular values of Vandermonde matrices with nodes in the unit disk and a novel large sieve inequality involving polynomials in z \in C with |z|<=1. Compared to Bazán's upper bound on the condition number [3], which, to the best of our knowledge, constitutes the only analytical result---available in the literature---on the condition number of Vandermonde matrices with nodes in the unit disk, our bound not only takes a much simpler form, but is also sharper for certain node configurations. Moreover, the bound we obtain can be evaluated consistently in a numerically stable fashion, whereas the evaluation of Bazán's bound requires the solution of a linear system of equations which has the same condition number as the Vandermonde matrix under consideration and can therefore lead to numerical instability in practice. As a byproduct, our result---when particularized to the case of nodes on the unit circle---slightly improves upon the Selberg-Moitra bound.

LGMay 4, 2017
Optimal Approximation with Sparsely Connected Deep Neural Networks

Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok et al.

We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in $L^2(\mathbb R^d)$. In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems---so-called \emph{affine systems}---can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, $α$-shearlets, and more generally $α$-molecules. Our central result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of $α^{-1}$-cartoon-like functions, which is approximated optimally by $α$-shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm generates deep neural networks providing close-to-optimal approximation rates. Moreover, these results indicate that stochastic gradient descent can actually learn approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.

ITApr 12, 2017
Energy Propagation in Deep Convolutional Neural Networks

Thomas Wiatowski, Philipp Grohs, Helmut Bölcskei

Many practical machine learning tasks employ very deep convolutional neural networks. Such large depths pose formidable computational challenges in training and operating the network. It is therefore important to understand how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers. In addition, it is desirable that the feature extractor generated by the network be informative in the sense of the only signal mapping to the all-zeros feature vector being the zero input signal. This "trivial null-set" property can be accomplished by asking for "energy conservation" in the sense of the energy in the feature vector being proportional to that of the corresponding input signal. This paper establishes conditions for energy conservation (and thus for a trivial null-set) for a wide class of deep convolutional neural network-based feature extractors and characterizes corresponding feature map energy decay rates. Specifically, we consider general scattering networks employing the modulus non-linearity and we find that under mild analyticity and high-pass conditions on the filters (which encompass, inter alia, various constructions of Weyl-Heisenberg filters, wavelets, ridgelets, ($α$)-curvelets, and shearlets) the feature map energy decays at least polynomially fast. For broad families of wavelets and Weyl-Heisenberg filters, the guaranteed decay rate is shown to be exponential. Moreover, we provide handy estimates of the number of layers needed to have at least $((1-\varepsilon)\cdot 100)\%$ of the input signal energy be contained in the feature vector.

LGDec 11, 2016
Noisy subspace clustering via matching pursuits

Michael Tschannen, Helmut Bölcskei

Sparsity-based subspace clustering algorithms have attracted significant attention thanks to their excellent performance in practical applications. A prominent example is the sparse subspace clustering (SSC) algorithm by Elhamifar and Vidal, which performs spectral clustering based on an adjacency matrix obtained by sparsely representing each data point in terms of all the other data points via the Lasso. When the number of data points is large or the dimension of the ambient space is high, the computational complexity of SSC quickly becomes prohibitive. Dyer et al. observed that SSC-OMP obtained by replacing the Lasso by the greedy orthogonal matching pursuit (OMP) algorithm results in significantly lower computational complexity, while often yielding comparable performance. The central goal of this paper is an analytical performance characterization of SSC-OMP for noisy data. Moreover, we introduce and analyze the SSC-MP algorithm, which employs matching pursuit (MP) in lieu of OMP. Both SSC-OMP and SSC-MP are proven to succeed even when the subspaces intersect and when the data points are contaminated by severe noise. The clustering conditions we obtain for SSC-OMP and SSC-MP are similar to those for SSC and for the thresholding-based subspace clustering (TSC) algorithm due to Heckel and Bölcskei. Analytical results in combination with numerical results indicate that both SSC-OMP and SSC-MP with a data-dependent stopping criterion automatically detect the dimensions of the subspaces underlying the data. Moreover, experiments on synthetic and on real data show that SSC-MP compares very favorably to SSC, SSC-OMP, TSC, and the nearest subspace neighbor algorithm, both in terms of clustering performance and running time. In addition, we find that, in contrast to SSC-OMP, the performance of SSC-MP is very robust with respect to the choice of parameters in the stopping criteria.

LGDec 4, 2016
Robust nonparametric nearest neighbor random process clustering

Michael Tschannen, Helmut Bölcskei

We consider the problem of clustering noisy finite-length observations of stationary ergodic random processes according to their generative models without prior knowledge of the model statistics and the number of generative models. Two algorithms, both using the $L^1$-distance between estimated power spectral densities (PSDs) as a measure of dissimilarity, are analyzed. The first one, termed nearest neighbor process clustering (NNPC), relies on partitioning the nearest neighbor graph of the observations via spectral clustering. The second algorithm, simply referred to as $k$-means (KM), consists of a single $k$-means iteration with farthest point initialization and was considered before in the literature, albeit with a different dissimilarity measure. We prove that both algorithms succeed with high probability in the presence of noise and missing entries, and even when the generative process PSDs overlap significantly, all provided that the observation length is sufficiently large. Our results quantify the tradeoff between the overlap of the generative process PSDs, the observation length, the fraction of missing entries, and the noise variance. Finally, we provide extensive numerical results for synthetic and real data and find that NNPC outperforms state-of-the-art algorithms in human motion sequence clustering.

LGMay 26, 2016
Discrete Deep Feature Extraction: A Theory and New Architectures

Thomas Wiatowski, Michael Tschannen, Aleksandar Stanić et al.

First steps towards a mathematical theory of deep convolutional neural networks for feature extraction were made---for the continuous-time case---in Mallat, 2012, and Wiatowski and Bölcskei, 2015. This paper considers the discrete case, introduces new convolutional neural network architectures, and proposes a mathematical framework for their analysis. Specifically, we establish deformation and translation sensitivity results of local and global nature, and we investigate how certain structural properties of the input signal are reflected in the corresponding feature vectors. Our theory applies to general filters and general Lipschitz-continuous non-linearities and pooling operators. Experiments on handwritten digit classification and facial landmark detection---including feature importance evaluation---complement the theoretical findings.

LGApr 29, 2016
Deep Convolutional Neural Networks on Cartoon Functions

Philipp Grohs, Thomas Wiatowski, Helmut Bölcskei

Wiatowski and Bölcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are, hence, not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.

ITDec 19, 2015
A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction

Thomas Wiatowski, Helmut Bölcskei

Deep convolutional neural networks have led to breakthrough results in numerous practical machine learning tasks such as classification of images in the ImageNet data set, control-policy-learning to play Atari games or the board game Go, and image captioning. Many of these applications first perform feature extraction and then feed the results thereof into a trainable classifier. The mathematical analysis of deep convolutional neural networks for feature extraction was initiated by Mallat, 2012. Specifically, Mallat considered so-called scattering networks based on a wavelet transform followed by the modulus non-linearity in each network layer, and proved translation invariance (asymptotically in the wavelet scale parameter) and deformation stability of the corresponding feature extractor. This paper complements Mallat's results by developing a theory that encompasses general convolutional transforms, or in more technical parlance, general semi-discrete frames (including Weyl-Heisenberg filters, curvelets, shearlets, ridgelets, wavelets, and learned filters), general Lipschitz-continuous non-linearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and modulus functions), and general Lipschitz-continuous pooling operators emulating, e.g., sub-sampling and averaging. In addition, all of these elements can be different in different network layers. For the resulting feature extractor we prove a translation invariance result of vertical nature in the sense of the features becoming progressively more translation-invariant with increasing network depth, and we establish deformation sensitivity bounds that apply to signal classes such as, e.g., band-limited functions, cartoon functions, and Lipschitz functions.

MLJul 25, 2015
Dimensionality-reduced subspace clustering

Reinhard Heckel, Michael Tschannen, Helmut Bölcskei

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, whose number, orientations, and dimensions are all unknown. In practice one may have access to dimensionality-reduced observations of the data only, resulting, e.g., from undersampling due to complexity and speed constraints on the acquisition device or mechanism. More pertinently, even if the high-dimensional data set is available it is often desirable to first project the data points into a lower-dimensional space and to perform clustering there; this reduces storage requirements and computational cost. The purpose of this paper is to quantify the impact of dimensionality reduction through random projection on the performance of three subspace clustering algorithms, all of which are based on principles from sparse signal recovery. Specifically, we analyze the thresholding based subspace clustering (TSC) algorithm, the sparse subspace clustering (SSC) algorithm, and an orthogonal matching pursuit variant thereof (SSC-OMP). We find, for all three algorithms, that dimensionality reduction down to the order of the subspace dimensions is possible without incurring significant performance degradation. Moreover, these results are order-wise optimal in the sense that reducing the dimensionality further leads to a fundamentally ill-posed clustering problem. Our findings carry over to the noisy case as illustrated through analytical results for TSC and simulations for SSC and SSC-OMP. Extensive experiments on synthetic and real data complement our theoretical findings.

LGApr 21, 2015
Deep Convolutional Neural Networks Based on Semi-Discrete Frames

Thomas Wiatowski, Helmut Bölcskei

Deep convolutional neural networks have led to breakthrough results in practical feature extraction applications. The mathematical analysis of these networks was pioneered by Mallat, 2012. Specifically, Mallat considered so-called scattering networks based on identical semi-discrete wavelet frames in each network layer, and proved translation-invariance as well as deformation stability of the resulting feature extractor. The purpose of this paper is to develop Mallat's theory further by allowing for different and, most importantly, general semi-discrete frames (such as, e.g., Gabor frames, wavelets, curvelets, shearlets, ridgelets) in distinct network layers. This allows to extract wider classes of features than point singularities resolved by the wavelet transform. Our generalized feature extractor is proven to be translation-invariant, and we develop deformation stability results for a larger class of deformations than those considered by Mallat. For Mallat's wavelet-based feature extractor, we get rid of a number of technical conditions. The mathematical engine behind our results is continuous frame theory, which allows us to completely detach the invariance and deformation stability proofs from the particular algebraic structure of the underlying frames.

MLApr 20, 2015
Nonparametric Nearest Neighbor Random Process Clustering

Michael Tschannen, Helmut Bölcskei

We consider the problem of clustering noisy finite-length observations of stationary ergodic random processes according to their nonparametric generative models without prior knowledge of the model statistics and the number of generative models. Two algorithms, both using the L1-distance between estimated power spectral densities (PSDs) as a measure of dissimilarity, are analyzed. The first algorithm, termed nearest neighbor process clustering (NNPC), to the best of our knowledge, is new and relies on partitioning the nearest neighbor graph of the observations via spectral clustering. The second algorithm, simply referred to as k-means (KM), consists of a single k-means iteration with farthest point initialization and was considered before in the literature, albeit with a different measure of dissimilarity and with asymptotic performance results only. We show that both NNPC and KM succeed with high probability under noise and even when the generative process PSDs overlap significantly, all provided that the observation length is sufficiently large. Our results quantify the tradeoff between the overlap of the generative process PSDs, the noise variance, and the observation length. Finally, we present numerical performance results for synthetic and real data.

ITApr 27, 2014
Subspace clustering of dimensionality-reduced data

Reinhard Heckel, Michael Tschannen, Helmut Bölcskei

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, assumed unknown. In practice one may have access to dimensionality-reduced observations of the data only, resulting, e.g., from "undersampling" due to complexity and speed constraints on the acquisition device. More pertinently, even if one has access to the high-dimensional data set it is often desirable to first project the data points into a lower-dimensional space and to perform the clustering task there; this reduces storage requirements and computational cost. The purpose of this paper is to quantify the impact of dimensionality-reduction through random projection on the performance of the sparse subspace clustering (SSC) and the thresholding based subspace clustering (TSC) algorithms. We find that for both algorithms dimensionality reduction down to the order of the subspace dimensions is possible without incurring significant performance degradation. The mathematical engine behind our theorems is a result quantifying how the affinities between subspaces change under random dimensionality reducing projections.

MLMar 13, 2014
Neighborhood Selection for Thresholding-based Subspace Clustering

Reinhard Heckel, Eirikur Agustsson, Helmut Bölcskei

Subspace clustering refers to the problem of clustering high-dimensional data points into a union of low-dimensional linear subspaces, where the number of subspaces, their dimensions and orientations are all unknown. In this paper, we propose a variation of the recently introduced thresholding-based subspace clustering (TSC) algorithm, which applies spectral clustering to an adjacency matrix constructed from the nearest neighbors of each data point with respect to the spherical distance measure. The new element resides in an individual and data-driven choice of the number of nearest neighbors. Previous performance results for TSC, as well as for other subspace clustering algorithms based on spectral clustering, come in terms of an intermediate performance measure, which does not address the clustering error directly. Our main analytical contribution is a performance analysis of the modified TSC algorithm (as well as the original TSC algorithm) in terms of the clustering error directly.

MLNov 13, 2013
Compressive Nonparametric Graphical Model Selection For Time Series

Alexander Jung, Reinhard Heckel, Helmut Bölcskei et al.

We propose a method for inferring the conditional indepen- dence graph (CIG) of a high-dimensional discrete-time Gaus- sian vector random process from finite-length observations. Our approach does not rely on a parametric model (such as, e.g., an autoregressive model) for the vector random process; rather, it only assumes certain spectral smoothness proper- ties. The proposed inference scheme is compressive in that it works for sample sizes that are (much) smaller than the number of scalar process components. We provide analytical conditions for our method to correctly identify the CIG with high probability.

MLJul 18, 2013
Robust Subspace Clustering via Thresholding

Reinhard Heckel, Helmut Bölcskei

The problem of clustering noisy and incompletely observed high-dimensional data points into a union of low-dimensional subspaces and a set of outliers is considered. The number of subspaces, their dimensions, and their orientations are assumed unknown. We propose a simple low-complexity subspace clustering algorithm, which applies spectral clustering to an adjacency matrix obtained by thresholding the correlations between data points. In other words, the adjacency matrix is constructed from the nearest neighbors of each data point in spherical distance. A statistical performance analysis shows that the algorithm exhibits robustness to additive noise and succeeds even when the subspaces intersect. Specifically, our results reveal an explicit tradeoff between the affinity of the subspaces and the tolerable noise level. We furthermore prove that the algorithm succeeds even when the data points are incompletely observed with the number of missing entries allowed to be (up to a log-factor) linear in the ambient dimension. We also propose a simple scheme that provably detects outliers, and we present numerical results on real and synthetic data.

ITMay 15, 2013
Noisy Subspace Clustering via Thresholding

Reinhard Heckel, Helmut Bölcskei

We consider the problem of clustering noisy high-dimensional data points into a union of low-dimensional subspaces and a set of outliers. The number of subspaces, their dimensions, and their orientations are unknown. A probabilistic performance analysis of the thresholding-based subspace clustering (TSC) algorithm introduced recently in [1] shows that TSC succeeds in the noisy case, even when the subspaces intersect. Our results reveal an explicit tradeoff between the allowed noise level and the affinity of the subspaces. We furthermore find that the simple outlier detection scheme introduced in [1] provably succeeds in the noisy case.

ITMar 15, 2013
Subspace Clustering via Thresholding and Spectral Clustering

Reinhard Heckel, Helmut Bölcskei

We consider the problem of clustering a set of high-dimensional data points into sets of low-dimensional linear subspaces. The number of subspaces, their dimensions, and their orientations are unknown. We propose a simple and low-complexity clustering algorithm based on thresholding the correlations between the data points followed by spectral clustering. A probabilistic performance analysis shows that this algorithm succeeds even when the subspaces intersect, and when the dimensions of the subspaces scale (up to a log-factor) linearly in the ambient dimension. Moreover, we prove that the algorithm also succeeds for data points that are subject to erasures with the number of erasures scaling (up to a log-factor) linearly in the ambient dimension. Finally, we propose a simple scheme that provably detects outliers.