Demba Ba

LG
h-index34
35papers
362citations
Novelty54%
AI Score52

35 Papers

SPSep 28, 2022
Unrolled Compressed Blind-Deconvolution

Bahareh Tolooshams, Satish Mulleti, Demba Ba et al.

The problem of sparse multichannel blind deconvolution (S-MBD) arises frequently in many engineering applications such as radar/sonar/ultrasound imaging. To reduce its computational and implementation cost, we propose a compression method that enables blind recovery from much fewer measurements with respect to the full received signal in time. The proposed compression measures the signal through a filter followed by a subsampling, allowing for a significant reduction in implementation cost. We derive theoretical guarantees for the identifiability and recovery of a sparse filter from compressed measurements. Our results allow for the design of a wide class of compression filters. We, then, propose a data-driven unrolled learning framework to learn the compression filter and solve the S-MBD problem. The encoder is a recurrent inference network that maps compressed measurements into an estimate of sparse filters. We demonstrate that our unrolled learning method is more robust to choices of source shapes and has better recovery performance compared to optimization-based methods. Finally, in data-limited applications (fewshot learning), we highlight the superior generalization capability of unrolled learning compared to conventional deep learning.

LGJun 5, 2023
Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood Estimation for Latent Gaussian Models

Alexander Lin, Bahareh Tolooshams, Yves Atchadé et al.

Latent Gaussian models have a rich history in statistics and machine learning, with applications ranging from factor analysis to compressed sensing to time series analysis. The classical method for maximizing the likelihood of these models is the expectation-maximization (EM) algorithm. For problems with high-dimensional latent variables and large datasets, EM scales poorly because it needs to invert as many large covariance matrices as the number of data points. We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversion. Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation. In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.

LGNov 5, 2025
Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies

Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel et al.

Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman--Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.

AIFeb 22, 2023
Sparse, Geometric Autoencoder Models of V1

Jonathan Huml, Abiy Tasissa, Demba Ba

The classical sparse coding model represents visual stimuli as a linear combination of a handful of learned basis functions that are Gabor-like when trained on natural image data. However, the Gabor-like filters learned by classical sparse coding far overpredict well-tuned simple cell receptive field (SCRF) profiles. A number of subsequent models have either discarded the sparse dictionary learning framework entirely or have yet to take advantage of the surge in unrolled, neural dictionary learning architectures. A key missing theme of these updates is a stronger notion of \emph{structured sparsity}. We propose an autoencoder architecture whose latent representations are implicitly, locally organized for spectral clustering, which begets artificial neurons better matched to observed primate data. The weighted-$\ell_1$ (WL) constraint in the autoencoder objective function maintains core ideas of the sparse coding framework, yet also offers a promising path to describe the differentiation of receptive fields in terms of a discriminative hierarchy in future work.

LGNov 16, 2022
Learning unfolded networks with a cyclic group structure

Emmanouil Theodosis, Demba Ba

Deep neural networks lack straightforward ways to incorporate domain knowledge and are notoriously considered black boxes. Prior works attempted to inject domain knowledge into architectures implicitly through data augmentation. Building on recent advances on equivariant neural networks, we propose networks that explicitly encode domain knowledge, specifically equivariance with respect to rotations. By using unfolded architectures, a rich framework that originated from sparse coding and has theoretical guarantees, we present interpretable networks with sparse activations. The equivariant unfolded networks compete favorably with baselines, with only a fraction of their parameters, as showcased on (rotated) MNIST and CIFAR-10.

LGNov 3, 2025
Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur et al.

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

CVDec 23, 2025
Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs, Thomas Fel, Richard Hakim et al.

As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

NCNov 30, 2023
Clustering Inductive Biases with Unrolled Networks

Jonathan Huml, Abiy Tasissa, Demba Ba

The classical sparse coding (SC) model represents visual stimuli as a linear combination of a handful of learned basis functions that are Gabor-like when trained on natural image data. However, the Gabor-like filters learned by classical sparse coding far overpredict well-tuned simple cell receptive field profiles observed empirically. While neurons fire sparsely, neuronal populations are also organized in physical space by their sensitivity to certain features. In V1, this organization is a smooth progression of orientations along the cortical sheet. A number of subsequent models have either discarded the sparse dictionary learning framework entirely or whose updates have yet to take advantage of the surge in unrolled, neural dictionary learning architectures. A key missing theme of these updates is a stronger notion of \emph{structured sparsity}. We propose an autoencoder architecture (WLSC) whose latent representations are implicitly, locally organized for spectral clustering through a Laplacian quadratic form of a bipartite graph, which generates a diverse set of artificial receptive fields that match primate data in V1 as faithfully as recent contrastive frameworks like Local Low Dimensionality, or LLD \citep{lld} that discard sparse dictionary learning. By unifying sparse and smooth coding in models of the early visual cortex through our autoencoder, we also show that our regularization can be interpreted as early-stage specialization of receptive fields to certain classes of stimuli; that is, we induce a weak clustering bias for later stages of cortex where functional and spatial segregation (i.e. topography) are known to occur. The results show an imperative for \emph{spatial regularization} of both the receptive fields and firing rates to begin to describe feature disentanglement in V1 and beyond.

SPSep 30, 2023
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing

Alexander Lin, Demba Ba

This paper considers clustered multi-task compressive sensing, a hierarchical model that solves multiple compressive sensing tasks by finding clusters of tasks that leverage shared information to mutually improve signal reconstruction. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. The main bottleneck involves repeated matrix inversion and log-determinant computation for multiple large covariance matrices. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices. Our approach combines Monte Carlo sampling with iterative linear solvers. Our experiments reveal that compared to the existing baseline, our algorithm can be up to thousands of times faster and an order of magnitude more memory-efficient.

LGMar 3, 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel et al.

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

CVFeb 18, 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince et al. · harvard

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

LGJun 3, 2025
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana et al.

Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

CVOct 8, 2025
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori et al. · harvard

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

LGJun 5, 2025
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana et al.

Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

CVFeb 9, 2025
Traveling Waves Integrate Spatial Information Through Time

Mozes Jacobs, Roberto C. Budzinski, Lyle Muller et al.

Traveling waves of neural activity are widely observed in the brain, but their precise computational function remains unclear. One prominent hypothesis is that they enable the transfer and integration of spatial information across neural populations. However, few computational models have explored how traveling waves might be harnessed to perform such integrative processing. Drawing inspiration from the famous "Can one hear the shape of a drum?" problem -- which highlights how normal modes of wave dynamics encode geometric information -- we investigate whether similar principles can be leveraged in artificial neural networks. Specifically, we introduce convolutional recurrent neural networks that learn to produce traveling waves in their hidden states in response to visual stimuli, enabling spatial integration. By then treating these wave-like activation sequences as visual representations themselves, we obtain a powerful representational space that outperforms local feed-forward networks on tasks requiring global spatial context. In particular, we observe that traveling waves effectively expand the receptive field of locally connected neurons, supporting long-range encoding and communication of information. We demonstrate that models equipped with this mechanism solve visual semantic segmentation tasks demanding global integration, significantly outperforming local feed-forward models and rivaling non-local U-Net models with fewer parameters. As a first step toward traveling-wave-based communication and visual representation in artificial networks, our findings suggest wave-dynamics may provide efficiency and training stability benefits, while simultaneously offering a new framework for connecting models to biological recordings of neural activity.

LGMay 29, 2023
Learning Linear Groups in Neural Networks

Emmanouil Theodosis, Karim Helwani, Demba Ba

Employing equivariance in neural networks leads to greater parameter efficiency and improved generalization performance through the encoding of domain knowledge in the architecture; however, the majority of existing approaches require an a priori specification of the desired symmetries. We present a neural network architecture, Linear Group Networks (LGNs), for learning linear groups acting on the weight space of neural networks. Linear groups are desirable due to their inherent interpretability, as they can be represented as finite matrices. LGNs learn groups without any supervision or knowledge of the hidden symmetries in the data and the groups can be mapped to well known operations in machine learning. We use LGNs to learn groups on multiple datasets while considering different downstream tasks; we demonstrate that the linear group structure depends on both the data distribution and the considered task.

SPFeb 25, 2022
High-Dimensional Sparse Bayesian Learning without Covariance Matrices

Alexander Lin, Andrew H. Song, Berkin Bilgic et al.

Sparse Bayesian learning (SBL) is a powerful framework for tackling the sparse coding problem. However, the most popular inference algorithms for SBL become too expensive for high-dimensional settings, due to the need to store and compute a large covariance matrix. We introduce a new inference scheme that avoids explicit construction of the covariance matrix by solving multiple linear systems in parallel to obtain the posterior moments for SBL. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory, especially for structured dictionaries capable of fast matrix-vector multiplication.

LGOct 10, 2021
Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning

Alexander Lin, Andrew H. Song, Demba Ba

State-of-the-art approaches for clustering high-dimensional data utilize deep auto-encoder architectures. Many of these networks require a large number of parameters and suffer from a lack of interpretability, due to the black-box nature of the auto-encoders. We introduce Mixture Model Auto-Encoders (MixMate), a novel architecture that clusters data by performing inference on a generative model. Derived from the perspective of sparse dictionary learning and mixture models, MixMate comprises several auto-encoders, each tasked with reconstructing data in a distinct cluster, while enforcing sparsity in the latent space. Through experiments on various image datasets, we show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms, while using orders of magnitude fewer parameters.

LGMay 31, 2021
Stable and Interpretable Unrolled Dictionary Learning

Bahareh Tolooshams, Demba Ba

The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a "good" initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent code. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.

SPMay 21, 2021
Covariance-Free Sparse Bayesian Learning

Alexander Lin, Andrew H. Song, Berkin Bilgic et al.

Sparse Bayesian learning (SBL) is a powerful framework for tackling the sparse coding problem while also providing uncertainty quantification. The most popular inference algorithms for SBL exhibit prohibitively large computational costs for high-dimensional problems due to the need to maintain a large covariance matrix. To resolve this issue, we introduce a new method for accelerating SBL inference -- named covariance-free expectation maximization (CoFEM) -- that avoids explicit computation of the covariance matrix. CoFEM solves multiple linear systems to obtain unbiased estimates of the posterior statistics needed by SBL. This is accomplished by exploiting innovations from numerical linear algebra such as preconditioned conjugate gradient and a little-known diagonal estimation rule. For a large class of compressed sensing matrices, we provide theoretical justifications for why our method scales well in high-dimensional settings. Through simulations, we show that CoFEM can be up to thousands of times faster than existing baselines without sacrificing coding accuracy. Through applications to calcium imaging deconvolution and multi-contrast MRI reconstruction, we show that CoFEM enables SBL to tractably tackle high-dimensional sparse coding problems of practical interest.

SPApr 28, 2021
Weighed l1 on the simplex: Compressive sensing meets locality

Abiy Tasissa, Pranay Tankala, Demba Ba

Sparse manifold learning algorithms combine techniques in manifold learning and sparse optimization to learn features that could be utilized for downstream tasks. The standard setting of compressive sensing can not be immediately applied to this setup. Due to the intrinsic geometric structure of data, dictionary atoms might be redundant and do not satisfy the restricted isometry property or coherence condition. In addition, manifold learning emphasizes learning local geometry which is not reflected in a standard $\ell_1$ minimization problem. We propose weighted $\ell_0$ and weighted $\ell_1$ metrics that encourage representation via neighborhood atoms suited for dictionary based manifold learning. Assuming that the data is generated from Delaunay triangulation, we show the equivalence of weighted $\ell_0$ and weighted $\ell_1$. We discuss an optimization program that learns the dictionaries and sparse coefficients and demonstrate the utility of our regularization on synthetic and real datasets.

LGMar 28, 2021
Gaussian Process Convolutional Dictionary Learning

Andrew H. Song, Bahareh Tolooshams, Demba Ba

Convolutional dictionary learning (CDL), the problem of estimating shift-invariant templates from data, is typically conducted in the absence of a prior/structure on the templates. In data-scarce or low signal-to-noise ratio (SNR) regimes, learned templates overfit the data and lack smoothness, which can affect the predictive performance of downstream tasks. To address this limitation, we propose GPCDL, a convolutional dictionary learning framework that enforces priors on templates using Gaussian Processes (GPs). With the focus on smoothness, we show theoretically that imposing a GP prior is equivalent to Wiener filtering the learned templates, thereby suppressing high-frequency components and promoting smoothness. We show that the algorithm is a simple extension of the classical iteratively reweighted least squares algorithm, independent of the choice of GP kernels. This property allows one to experiment flexibly with different smoothness assumptions. Through simulation, we show that GPCDL learns smooth dictionaries with better accuracy than the unregularized alternative across a range of SNRs. Through an application to neural spiking data, we show that GPCDL learns a more accurate and visually-interpretable smooth dictionary, leading to superior predictive performance compared to non-regularized CDL, as well as parametric alternatives.

LGFeb 13, 2021
On the convergence of group-sparse autoencoders

Emmanouil Theodosis, Bahareh Tolooshams, Pranay Tankala et al.

Recent approaches in the theoretical analysis of model-based deep learning architectures have studied the convergence of gradient descent in shallow ReLU networks that arise from generative models whose hidden layers are sparse. Motivated by the success of architectures that impose structured forms of sparsity, we introduce and study a group-sparse autoencoder that accounts for a variety of generative models, and utilizes a group-sparse ReLU activation function to force the non-zero units at a given layer to occur in blocks. For clustering models, inputs that result in the same group of active units belong to the same cluster. We proceed to analyze the gradient dynamics of a shallow instance of the proposed autoencoder, trained with data adhering to a group-sparse generative model. In this setting, we theoretically prove the convergence of the network parameters to a neighborhood of the generating matrix. We validate our model through numerical analysis and highlight the superior performance of networks with a group-sparse ReLU compared to networks that utilize traditional ReLUs, both in sparse coding and in parameter recovery tasks. We also provide real data experiments to corroborate the simulated results, and emphasize the clustering capabilities of structured sparsity models.

LGDec 3, 2020
K-Deep Simplex: Deep Manifold Learning via Local Dictionaries

Pranay Tankala, Abiy Tasissa, James M. Murphy et al.

We propose K-Deep Simplex(KDS) which, given a set of data points, learns a dictionary comprising synthetic landmarks, along with representation coefficients supported on a simplex. KDS employs a local weighted $\ell_1$ penalty that encourages each data point to represent itself as a convex combination of nearby landmarks. We solve the proposed optimization program using alternating minimization and design an efficient, interpretable autoencoder using algorithm unrolling. We theoretically analyze the proposed program by relating the weighted $\ell_1$ penalty in KDS to a weighted $\ell_0$ program. Assuming that the data are generated from a Delaunay triangulation, we prove the equivalence of the weighted $\ell_1$ and weighted $\ell_0$ programs. We further show the stability of the representation coefficients under mild geometrical assumptions. If the representation coefficients are fixed, we prove that the sub-problem of minimizing over the dictionary yields a unique solution. Further, we show that low-dimensional representations can be efficiently obtained from the covariance of the coefficient matrix. Experiments show that the algorithm is highly efficient and performs competitively on synthetic and real data sets.

SPOct 22, 2020
Unfolding Neural Networks for Compressive Multichannel Blind Deconvolution

Bahareh Tolooshams, Satish Mulleti, Demba Ba et al.

We propose a learned-structured unfolding neural network for the problem of compressive sparse multichannel blind-deconvolution. In this problem, each channel's measurements are given as convolution of a common source signal and sparse filter. Unlike prior works where the compression is achieved either through random projections or by applying a fixed structured compression matrix, this paper proposes to learn the compression matrix from data. Given the full measurements, the proposed network is trained in an unsupervised fashion to learn the source and estimate sparse filters. Then, given the estimated source, we learn a structured compression operator while optimizing for signal reconstruction and sparse filter recovery. The efficient structure of the compression allows its practical hardware implementation. The proposed neural network is an autoencoder constructed based on an unfolding approach: upon training, the encoder maps the compressed measurements into an estimate of sparse filters using the compression operator and the source, and the linear convolutional decoder reconstructs the full measurements. We demonstrate that our method is superior to classical structured compressive sparse multichannel blind-deconvolution methods in terms of accuracy and speed of sparse filter recovery.

ITJun 16, 2020
Towards improving discriminative reconstruction via simultaneous dense and sparse coding

Abiy Tasissa, Emmanouil Theodosis, Bahareh Tolooshams et al.

Discriminative features extracted from the sparse coding model have been shown to perform well for classification. Recent deep learning architectures have further improved reconstruction in inverse problems by considering new dense priors learned from data. We propose a novel dense and sparse coding model that integrates both representation capability and discriminative features. The model studies the problem of recovering a dense vector $\mathbf{x}$ and a sparse vector $\mathbf{u}$ given measurements of the form $\mathbf{y} = \mathbf{A}\mathbf{x}+\mathbf{B}\mathbf{u}$. Our first analysis proposes a geometric condition based on the minimal angle between spanning subspaces corresponding to the matrices $\mathbf{A}$ and $\mathbf{B}$ that guarantees unique solution to the model. The second analysis shows that, under mild assumptions, a convex program recovers the dense and sparse components. We validate the effectiveness of the model on simulated data and propose a dense and sparse autoencoder (DenSaE) tailored to learning the dictionaries from the dense and sparse model. We demonstrate that (i) DenSaE denoises natural images better than architectures derived from the sparse coding model ($\mathbf{B}\mathbf{u}$), (ii) in the presence of noise, training the biases in the latter amounts to implicitly learning the $\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{u}$ model, (iii) $\mathbf{A}$ and $\mathbf{B}$ capture low- and high-frequency contents, respectively, and (iv) compared to the sparse coding model, DenSaE offers a balance between discriminative power and representation.

LGAug 25, 2019
RandNet: deep learning with compressed measurements of images

Thomas Chang, Bahareh Tolooshams, Demba Ba

Principal component analysis, dictionary learning, and auto-encoders are all unsupervised methods for learning representations from a large amount of training data. In all these methods, the higher the dimensions of the input data, the longer it takes to learn. We introduce a class of neural networks, termed RandNet, for learning representations using compressed random measurements of data of interest, such as images. RandNet extends the convolutional recurrent sparse auto-encoder architecture to dense networks and, more importantly, to the case when the input data are compressed random measurements of the original data. Compressing the input data makes it possible to fit a larger number of batches in memory during training. Moreover, in the case of sparse measurements,training is more efficient computationally. We demonstrate that, in unsupervised settings, RandNet performs dictionary learning using compressed data. In supervised settings, we show that RandNet can classify MNIST images with minimal loss in accuracy, despite being trained with random projections of the images that result in a 50% reduction in size. Overall, our results provide a general principled framework for training neural networks using compressed data.

LGJul 23, 2019
Convolutional Dictionary Learning in Hierarchical Networks

Javier Zazo, Bahareh Tolooshams, Demba Ba

Filter banks are a popular tool for the analysis of piecewise smooth signals such as natural images. Motivated by the empirically observed properties of scale and detail coefficients of images in the wavelet domain, we propose a hierarchical deep generative model of piecewise smooth signals that is a recursion across scales: the low pass scale coefficients at one layer are obtained by filtering the scale coefficients at the next layer, and adding a high pass detail innovation obtained by filtering a sparse vector. This recursion describes a linear dynamic system that is a non-Gaussian Markov process across scales and is closely related to multilayer-convolutional sparse coding (ML-CSC) generative model for deep networks, except that our model allows for deeper architectures, and combines sparse and non-sparse signal representations. We propose an alternating minimization algorithm for learning the filters in this hierarchical model given observations at layer zero, e.g., natural images. The algorithm alternates between a coefficient-estimation step and a filter update step. The coefficient update step performs sparse (detail) and smooth (scale) coding and, when unfolded, leads to a deep neural network. We use MNIST to demonstrate the representation capabilities of the model, and its derived features (coefficients) for classification.

SPJul 22, 2019
Fast Convolutional Dictionary Learning off the Grid

Andrew H. Song, Francisco J. Flores, Demba Ba

Given a continuous-time signal that can be modeled as the superposition of localized, time-shifted events from multiple sources, the goal of Convolutional Dictionary Learning (CDL) is to identify the location of the events--by Convolutional Sparse Coding (CSC)--and learn the template for each source--by Convolutional Dictionary Update (CDU). In practice, because we observe samples of the continuous-time signal on a uniformly-sampled grid in discrete time, classical CSC methods can only produce estimates of the times when the events occur on this grid, which degrades the performance of the CDU. We introduce a CDL framework that significantly reduces the errors arising from performing the estimation in discrete time. Specifically, we construct an expanded dictionary that comprises, not only discrete-time shifts of the templates, but also interpolated variants, obtained by bandlimited interpolation, that account for continuous-time shifts. For CSC, we develop a novel computationally efficient CSC algorithm, termed Convolutional Orthogonal Matching Pursuit with interpolated dictionary (COMP-INTERP). We benchmarked COMP-INTERP to Contiunuous Basis Pursuit (CBP), the state-of-the-art CSC algorithm for estimating off-the-grid events, and demonstrate, on simulated data, that 1) COMP-INTERP achieves a similar level of accuracy, and 2) is two orders of magnitude faster. For CDU, we derive a novel procedure to update the templates given sparse codes that can occur both on and off the discrete-time grid. We also show that 3) dictionary update with the overcomplete dictionary yields more accurate templates. Finally, we apply the algorithms to the spike sorting problem on electrophysiology recording and show their competitive performance.

LGJul 7, 2019
Convolutional dictionary learning based auto-encoders for natural exponential-family distributions

Bahareh Tolooshams, Andrew H. Song, Simona Temereanca et al.

We introduce a class of auto-encoder neural networks tailored to data from the natural exponential family (e.g., count data). The architectures are inspired by the problem of learning the filters in a convolutional generative model with sparsity constraints, often referred to as convolutional dictionary learning (CDL). Our work is the first to combine ideas from convolutional generative models and deep learning for data that are naturally modeled with a non-Gaussian distribution (e.g., binomial and Poisson). This perspective provides us with a scalable and flexible framework that can be re-purposed for a wide range of tasks and assumptions on the generative model. Specifically, the iterative optimization procedure for solving CDL, an unsupervised task, is mapped to an unfolded and constrained neural network, with iterative adjustments to the inputs to account for the generative distribution. We also show that the framework can easily be extended for discriminative training, appropriate for a supervised task. We demonstrate 1) that fitting the generative model to learn, in an unsupervised fashion, the latent stimulus that underlies neural spiking data leads to better goodness-of-fit compared to other baselines, 2) competitive performance compared to state-of-the-art algorithms for supervised Poisson image denoising, with significantly fewer parameters, and 3) gradient dynamics of shallow binomial auto-encoder.

LGApr 18, 2019
Deep Residual Autoencoders for Expectation Maximization-inspired Dictionary Learning

Bahareh Tolooshams, Sourav Dey, Demba Ba

We introduce a neural-network architecture, termed the constrained recurrent sparse autoencoder (CRsAE), that solves convolutional dictionary learning problems, thus establishing a link between dictionary learning and neural networks. Specifically, we leverage the interpretation of the alternating-minimization algorithm for dictionary learning as an approximate Expectation-Maximization algorithm to develop autoencoders that enable the simultaneous training of the dictionary and regularization parameter (ReLU bias). The forward pass of the encoder approximates the sufficient statistics of the E-step as the solution to a sparse coding problem, using an iterative proximal gradient algorithm called FISTA. The encoder can be interpreted either as a recurrent neural network or as a deep residual network, with two-sided ReLU non-linearities in both cases. The M-step is implemented via a two-stage back-propagation. The first stage relies on a linear decoder applied to the encoder and a norm-squared loss. It parallels the dictionary update step in dictionary learning. The second stage updates the regularization parameter by applying a loss function to the encoder that includes a prior on the parameter motivated by Bayesian statistics. We demonstrate in an image-denoising task that CRsAE learns Gabor-like filters, and that the EM-inspired approach for learning biases is superior to the conventional approach. In an application to recordings of electrical activity from the brain, we demonstrate that CRsAE learns realistic spike templates and speeds up the process of identifying spike times by 900x compared to algorithms based on convex optimization.

MLOct 23, 2018
Clustering Time Series with Nonlinear Dynamics: A Bayesian Non-Parametric and Particle-Based Approach

Alexander Lin, Yingzhuo Zhang, Jeremy Heng et al.

We propose a general statistical framework for clustering multiple time series that exhibit nonlinear dynamics into an a-priori-unknown number of sub-groups. Our motivation comes from neuroscience, where an important problem is to identify, within a large assembly of neurons, subsets that respond similarly to a stimulus or contingency. Upon modeling the multiple time series as the output of a Dirichlet process mixture of nonlinear state-space models, we derive a Metropolis-within-Gibbs algorithm for full Bayesian inference that alternates between sampling cluster assignments and sampling parameter values that form the basis of the clustering. The Metropolis step employs recent innovations in particle-based methods. We apply the framework to clustering time series acquired from the prefrontal cortex of mice in an experiment designed to characterize the neural underpinnings of fear.

LGJul 12, 2018
Scalable Convolutional Dictionary Learning with Constrained Recurrent Sparse Auto-encoders

Bahareh Tolooshams, Sourav Dey, Demba Ba

Given a convolutional dictionary underlying a set of observed signals, can a carefully designed auto-encoder recover the dictionary in the presence of noise? We introduce an auto-encoder architecture, termed constrained recurrent sparse auto-encoder (CRsAE), that answers this question in the affirmative. Given an input signal and an approximate dictionary, the encoder finds a sparse approximation using FISTA. The decoder reconstructs the signal by applying the dictionary to the output of the encoder. The encoder and decoder in CRsAE parallel the sparse-coding and dictionary update steps in optimization-based alternating-minimization schemes for dictionary learning. As such, the parameters of the encoder and decoder are not independent, a constraint which we enforce for the first time. We derive the back-propagation algorithm for CRsAE. CRsAE is a framework for blind source separation that, only knowing the number of sources (dictionary elements), and assuming sparsely-many can overlap, is able to separate them. We demonstrate its utility in the context of spike sorting, a source separation problem in computational neuroscience. We demonstrate the ability of CRsAE to recover the underlying dictionary and characterize its sensitivity as a function of SNR.

SPJul 5, 2018
Deeply-Sparse Signal rePresentations ($\text{D}\text{S}^2\text{P}$)

Demba Ba

A recent line of work shows that a deep neural network with ReLU nonlinearities arises from a finite sequence of cascaded sparse coding models, the outputs of which, except for the last element in the cascade, are sparse and unobservable. That is, intermediate outputs deep in the cascade are sparse, hence the title of this manuscript. We show here, using techniques from the dictionary learning literature that, if the measurement matrices in the cascaded sparse coding model (a) satisfy RIP and (b) all have sparse columns except for the last, they can be recovered with high probability. We propose two algorithms for this purpose: one that recovers the matrices in a forward sequence, and another that recovers them in a backward sequence. The method of choice in deep learning to solve this problem is by training an auto-encoder. Our algorithms provide a sound alternative, with theoretical guarantees, as well upper bounds on sample complexity. The theory shows that the learning complexity of the forward algorithm depends on the number of hidden units at the deepest layer and the number of active neurons at that layer (sparsity). In addition, the theory relates the number of hidden units in successive layers, thus giving a practical prescription for designing deep ReLU neural networks. Because it puts fewer restrictions on the architecture, the backward algorithm requires more data. We demonstrate the deep dictionary learning algorithm via simulations. Finally, we use a coupon-collection argument to conjecture a lower bound on sample complexity that gives some insight as to why deep networks require more data to train than shallow ones.

MLMay 18, 2018
Multitaper Spectral Estimation HDP-HMMs for EEG Sleep Inference

Leon Chlon, Andrew Song, Sandya Subramanian et al.

Electroencephalographic (EEG) monitoring of neural activity is widely used for sleep disorder diagnostics and research. The standard of care is to manually classify 30-second epochs of EEG time-domain traces into 5 discrete sleep stages. Unfortunately, this scoring process is subjective and time-consuming, and the defined stages do not capture the heterogeneous landscape of healthy and clinical neural dynamics. This motivates the search for a data-driven and principled way to identify the number and composition of salient, reoccurring brain states present during sleep. To this end, we propose a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM), combined with wide-sense stationary (WSS) time series spectral estimation to construct a generative model for personalized subject sleep states. In addition, we employ multitaper spectral estimation to further reduce the large variance of the spectral estimates inherent to finite-length EEG measurements. By applying our method to both simulated and human sleep data, we arrive at three main results: 1) a Bayesian nonparametric automated algorithm that recovers general temporal dynamics of sleep, 2) identification of subject-specific "microstates" within canonical sleep stages, and 3) discovery of stage-dependent sub-oscillations with shared spectral signatures across subjects.