Boaz Nadler

ML
h-index43
37papers
1,197citations
Novelty62%
AI Score46

37 Papers

CLJul 3, 2024
Truth is Universal: Robust Detection of Lies in LLMs

Lennart Bürger, Fred A. Hamprecht, Boaz Nadler

Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios.

NAMar 4, 2018
Newton correction methods for computing real eigenpairs of symmetric tensors

Ariel Jaffe, Roi Weiss, Boaz Nadler

Real eigenpairs of symmetric tensors play an important role in multiple applications. In this paper we propose and analyze a fast iterative Newton-based method to compute real eigenpairs of symmetric tensors. We derive sufficient conditions for a real eigenpair to be a stable fixed point for our method, and prove that given a sufficiently close initial guess, the convergence rate is quadratic. Empirically, our method converges to a significantly larger number of eigenpairs compared to previously proposed iterative methods, and with enough random initializations typically finds all real eigenpairs. In particular, for a generic symmetric tensor, the sufficient conditions for local convergence of our Newton-based method hold simultaneously for all its real eigenpairs.

NAApr 23, 2016
The discrete sign problem: uniqueness, recovery algorithms and phase retrieval applications

Ben Leshem, Oren Raz, Ariel Jaffe et al.

In this paper we consider the following real-valued and finite dimensional specific instance of the 1-D classical phase retrieval problem. Let ${\bf F}\in\mathbb{R}^N$ be an $N$-dimensional vector, whose discrete Fourier transform has a compact support. The sign problem is to recover ${\bf F}$ from its magnitude $|{\bf F}|$. First, in contrast to the classical 1-D phase problem which in general has multiple solutions, we prove that with sufficient over-sampling, the sign problem admits a unique solution. Next, we show that the sign problem can be viewed as a special case of a more general piecewise constant phase problem. Relying on this result, we derive a computationally efficient and robust to noise sign recovery algorithm. In the noise-free case and with a sufficiently high sampling rate, our algorithm is guaranteed to recover the true sign pattern. Finally, we present two phase retrieval applications of the sign problem: (i) vectorial phase retrieval with three measurement vectors; and (ii) recovery of two well separated 1-D objects.

MLJan 29, 2023
Imbalanced Mixed Linear Regression

Pini Zilber, Boaz Nadler

We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.

MLSep 5, 2024
Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data

Eyar Azar, Boaz Nadler

The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for {classification and} feature selection in high dimensions. We present simulations that complement our theoretical analysis.

LGJan 9, 2023
Distributed Sparse Linear Regression under Communication Constraints

Rodney Fonseca, Boaz Nadler

In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.

MLApr 27, 2023
A Majorization-Minimization Gauss-Newton Method for 1-Bit Matrix Completion

Xiaoqian Liu, Xu Han, Eric C. Chi et al.

In 1-bit matrix completion, the aim is to estimate an underlying low-rank matrix from a partial set of binary observations. We propose a novel method for 1-bit matrix completion called Majorization-Minimization Gauss-Newton (MMGN). Our method is based on the majorization-minimization principle, which converts the original optimization problem into a sequence of standard low-rank matrix completion problems. We solve each of these sub-problems by a factorization approach that explicitly enforces the assumed low-rank structure and then apply a Gauss-Newton method. Using simulations and a real data example, we illustrate that in comparison to existing 1-bit matrix completion methods, MMGN outputs comparable if not more accurate estimates. In addition, it is often significantly faster, and less sensitive to the spikiness of the underlying matrix. In comparison with three standard generic optimization approaches that directly minimize the original objective, MMGN also exhibits a clear computational advantage, especially when the fraction of observed entries is small.

MLSep 15, 2022
Recovery Guarantees for Distributed-OMP

Chen Amiraz, Robert Krauthgamer, Boaz Nadler

We study distributed schemes for high-dimensional sparse linear regression, based on orthogonal matching pursuit (OMP). Such schemes are particularly suited for settings where a central fusion center is connected to end machines, that have both computation and communication limitations. We prove that under suitable assumptions, distributed-OMP schemes recover the support of the regression vector with communication per machine linear in its sparsity and logarithmic in the dimension. Remarkably, this holds even at low signal-to-noise-ratios, where individual machines are unable to detect the support. Our simulations show that distributed-OMP schemes are competitive with more computationally intensive methods, and in some cases even outperform them.

23.1PEMar 10
SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction

Ortal Reshef, Ofer Glassman, Or Zuk et al.

Recovering a tree that represents the evolutionary history of a group of species is a key task in phylogenetics. Performing this task using sequence data from multiple genetic markers poses two key challenges. The first is the discordance between the evolutionary history of individual genes and that of the species. The second challenge is computational, as contemporary studies involve thousands of species. Here we present SDSR, a scalable divide-and-conquer approach for species tree reconstruction based on spectral graph theory. The algorithm recursively partitions the species into subsets until their sizes are below a given threshold. The trees of these subsets are reconstructed by a user-chosen species tree algorithm. Finally, these subtrees are merged to form the full tree. On the theoretical front, we derive recovery guarantees for SDSR, under the multispecies coalescent (MSC) model. We also perform a runtime complexity analysis. We show that SDSR, when combined with a species tree reconstruction algorithm as a subroutine, yields substantial runtime savings as compared to applying the same algorithm on the full data. Empirically, we evaluate SDSR on synthetic benchmark datasets with incomplete lineage sorting and horizontal gene transfer. In accordance with our theoretical analysis, the simulations show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes. In addition, SDSR achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.

MLJan 4, 2018Code
SpectralNet: Spectral Clustering using Deep Neural Networks

Uri Shaham, Kelly Stanton, Henry Li et al.

Spectral clustering is a leading and popular technique in unsupervised data analysis. Two of its major limitations are scalability and generalization of the spectral embedding (i.e., out-of-sample-extension). In this paper we introduce a deep learning approach to spectral clustering that overcomes the above shortcomings. Our network, which we call SpectralNet, learns a map that embeds input data points into the eigenspace of their associated graph Laplacian matrix and subsequently clusters them. We train SpectralNet using a procedure that involves constrained stochastic optimization. Stochastic optimization allows it to scale to large datasets, while the constraints, which are implemented using a special-purpose output layer, allow us to keep the network output orthogonal. Moreover, the map learned by SpectralNet naturally generalizes the spectral embedding to unseen data points. To further improve the quality of the clustering, we replace the standard pairwise Gaussian affinities with affinities leaned from unlabeled data using a Siamese network. Additional improvement can be achieved by applying the network to code representations produced, e.g., by standard autoencoders. Our end-to-end learning procedure is fully unsupervised. In addition, we apply VC dimension theory to derive a lower bound on the size of SpectralNet. State-of-the-art clustering results are reported on the Reuters dataset. Our implementation is publicly available at https://github.com/kstant0725/SpectralNet .

LGMay 19, 2025
RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees

Eilon Vaknin Laufer, Boaz Nadler

Recovering a low rank matrix from a subset of its entries, some of which may be corrupted, is known as the robust matrix completion (RMC) problem. Existing RMC methods have several limitations: they require a relatively large number of observed entries; they may fail under overparametrization, when their assumed rank is higher than the correct one; and many of them fail to recover even mildly ill-conditioned matrices. In this paper we propose a novel RMC method, denoted $\texttt{RGNMR}$, which overcomes these limitations. $\texttt{RGNMR}$ is a simple factorization-based iterative algorithm, which combines a Gauss-Newton linearization with removal of entries suspected to be outliers. On the theoretical front, we prove that under suitable assumptions, $\texttt{RGNMR}$ is guaranteed exact recovery of the underlying low rank matrix. Our theoretical results improve upon the best currently known for factorization-based methods. On the empirical front, we show via several simulations the advantages of $\texttt{RGNMR}$ over existing RMC methods, and in particular its ability to handle a small number of observed entries, overparameterization of the rank and ill-conditioned matrices.

LGJan 31, 2022
Inductive Matrix Completion: No Bad Local Minima and a Fast Algorithm

Pini Zilber, Boaz Nadler

The inductive matrix completion (IMC) problem is to recover a low rank matrix from few observed entries while incorporating prior knowledge about its row and column subspaces. In this work, we make three contributions to the IMC problem: (i) we prove that under suitable conditions, the IMC optimization landscape has no bad local minima; (ii) we derive a simple scheme with theoretical guarantees to estimate the rank of the unknown matrix; and (iii) we propose GNIMC, a simple Gauss-Newton based method to solve the IMC problem, analyze its runtime and derive recovery guarantees for it. The guarantees for GNIMC are sharper in several aspects than those available for other methods, including a quadratic convergence rate, fewer required observed entries and stability to errors or deviations from low-rank. Empirically, given entries observed uniformly at random, GNIMC recovers the underlying matrix substantially faster than several competing methods.

OCJun 24, 2021
GNMR: A provable one-line algorithm for low rank matrix recovery

Pini Zilber, Boaz Nadler

Low rank matrix recovery problems, including matrix completion and matrix sensing, appear in a broad range of applications. In this work we present GNMR -- an extremely simple iterative algorithm for low rank matrix recovery, based on a Gauss-Newton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both the matrix sensing and matrix completion settings. Some of these results improve upon the best currently known for other methods. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.

MLFeb 26, 2021
Spectral Top-Down Recovery of Latent Tree Models

Yariv Aizenbud, Ariel Jaffe, Meng Wang et al.

Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.

MLFeb 5, 2021
Distributed Sparse Normal Means Estimation with Sublinear Communication

Chen Amiraz, Robert Krauthgamer, Boaz Nadler

We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are $M$ machines, each holding $d$-dimensional observations of a $K$-sparse vector $μ$ corrupted by additive Gaussian noise. The $M$ machines are connected in a star topology to a fusion center, whose goal is to estimate the vector $μ$ with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the $\ell_2$ risk, the total communication must be high - at least linear in the dimension $d$. This phenomenon occurs, however, at very weak signals. We show that at signal-to-noise ratios (SNRs) that are sufficiently high - but not enough for recovery by any individual machine - the support of $μ$ can be correctly recovered with significantly less communication. Specifically, we present two algorithms for distributed estimation of a sparse mean vector corrupted by either Gaussian or sub-Gaussian noise. We then prove that above certain SNR thresholds, with high probability, these algorithms recover the correct support with total communication that is sublinear in the dimension $d$. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition $KM\ll \tfrac{d}{\log d}$, then with an additional round of sublinear communication, our algorithms achieve the centralized rate for the $\ell_2$ risk. Finally, we present simulations that illustrate the performance of our algorithms in different parameter regimes.

LGJan 3, 2021
Improved Convergence Guarantees for Learning Gaussian Mixture Models by EM and Gradient EM

Nimrod Segol, Boaz Nadler

We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights, all with an identity covariance matrix. We make two contributions. First, at the population level, we present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works. Assuming a separation of $Ω(\sqrt{\log K})$, we prove convergence of both methods to the global optima from an initialization region larger than those of previous works. Specifically, the initial guess of each component can be as far as (almost) half its distance to the nearest Gaussian. This is essentially the largest possible contraction region. Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM. In previous works, the required number of samples had a quadratic dependence on the maximal separation between the K components, and the resulting error estimate increased linearly with this maximal separation. In this manuscript we show that both quantities depend only logarithmically on the maximal separation.

LGMay 18, 2020
The Trimmed Lasso: Sparse Recovery Guarantees and Practical Optimization by the Generalized Soft-Min Penalty

Tal Amir, Ronen Basri, Boaz Nadler

We present a new approach to solve the sparse approximation or best subset selection problem, namely find a $k$-sparse vector ${\bf x}\in\mathbb{R}^d$ that minimizes the $\ell_2$ residual $\lVert A{\bf x}-{\bf y} \rVert_2$. We consider a regularized approach, whereby this residual is penalized by the non-convex $\textit{trimmed lasso}$, defined as the $\ell_1$-norm of ${\bf x}$ excluding its $k$ largest-magnitude entries. We prove that the trimmed lasso has several appealing theoretical properties, and in particular derive sparse recovery guarantees assuming successful optimization of the penalized objective. Next, we show empirically that directly optimizing this objective can be quite challenging. Instead, we propose a surrogate for the trimmed lasso, called the $\textit{generalized soft-min}$. This penalty smoothly interpolates between the classical lasso and the trimmed lasso, while taking into account all possible $k$-sparse patterns. The generalized soft-min penalty involves summation over $\binom{d}{k}$ terms, yet we derive a polynomial-time algorithm to compute it. This, in turn, yields a practical method for the original sparse approximation problem. Via simulations, we demonstrate its competitive performance compared to current state of the art.

MLFeb 28, 2020
Spectral neighbor joining for reconstruction of latent tree models

Ariel Jaffe, Noah Amsel, Yariv Aizenbud et al.

A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop Spectral Neighbor Joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.

OCFeb 5, 2020
Rank $2r$ iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries

Jonathan Bauch, Boaz Nadler, Pini Zilber

We present a new, simple and computationally efficient iterative method for low rank matrix completion. Our method is inspired by the class of factorization-type iterative algorithms, but substantially differs from them in the way the problem is cast. Precisely, given a target rank $r$, instead of optimizing on the manifold of rank $r$ matrices, we allow our interim estimated matrix to have a specific over-parametrized rank $2r$ structure. Our algorithm, denoted R2RILS for rank $2r$ iterative least squares, has low memory requirements, and at each iteration it solves a computationally cheap sparse least-squares problem. We motivate our algorithm by its theoretical analysis for the simplified case of a rank-1 matrix. Empirically, R2RILS is able to recover ill conditioned low rank matrices from very few observations -- near the information limit, and it is stable to additive noise.

MLJun 6, 2018
Beyond Trees: Classification with Sparse Pairwise Dependencies

Yaniv Tenzer, Amit Moscovich, Mary Frances Dorn et al.

Several classification methods assume that the underlying distributions follow tree-structured graphical models. Indeed, trees capture statistical dependencies between pairs of variables, which may be crucial to attain low classification errors. The resulting classifier is linear in the log-transformed univariate and bivariate densities that correspond to the tree edges. In practice, however, observed data may not be well approximated by trees. Yet, motivated by the importance of pairwise dependencies for accurate classification, here we propose to approximate the optimal decision boundary by a sparse linear combination of the univariate and bivariate log-transformed densities. Our proposed approach is semi-parametric in nature: we non-parametrically estimate the univariate and bivariate densities, remove pairs of variables that are nearly independent using the Hilbert-Schmidt independence criteria, and finally construct a linear SVM on the retained log-transformed densities. We demonstrate using both synthetic and real data that our resulting classifier, denoted SLB (Sparse Log-Bivariate density), is competitive with popular classification methods.

MLFeb 27, 2018
Learning Binary Latent Variable Models: A Tensor Eigenpair Approach

Ariel Jaffe, Roi Weiss, Shai Carmi et al.

Latent variable models with hidden binary units appear in various applications. Learning such models, in particular in the presence of noise, is a challenging computational problem. In this paper we propose a novel spectral approach to this problem, based on the eigenvectors of both the second order moment matrix and third order moment tensor of the observed data. We prove that under mild non-degeneracy conditions, our method consistently estimates the model parameters at the optimal parametric rate. Our tensor-based method generalizes previous orthogonal tensor decomposition approaches, where the hidden units were assumed to be either statistically independent or mutually exclusive. We illustrate the consistency of our method on simulated data and demonstrate its usefulness in learning a common model for population mixtures in genetics.

CVJun 22, 2017
On Detection of Faint Edges in Noisy Images

Nati Ofir, Meirav Galun, Sharon Alpert et al.

A fundamental question for edge detection in noisy images is how faint can an edge be and still be detected. In this paper we offer a formalism to study this question and subsequently introduce computationally efficient multiscale edge detection algorithms designed to detect faint edges in noisy images. In our formalism we view edge detection as a search in a discrete, though potentially large, set of feasible curves. First, we derive approximate expressions for the detection threshold as a function of curve length and the complexity of the search space. We then present two edge detection algorithms, one for straight edges, and the second for curved ones. Both algorithms efficiently search for edges in a large set of candidates by hierarchically constructing difference filters that match the curves traced by the sought edges. We demonstrate the utility of our algorithms in both simulations and applications involving challenging real images. Finally, based on these principles, we develop an algorithm for fiber detection and enhancement. We exemplify its utility to reveal and enhance nerve axons in light microscopy images.

MLMar 8, 2017
Unsupervised Ensemble Regression

Omer Dror, Boaz Nadler, Erhan Bilal et al.

Consider a regression problem where there is no labeled data and the only observations are the predictions $f_i(x_j)$ of $m$ experts $f_{i}$ over many samples $x_j$. With no knowledge on the accuracy of the experts, is it still possible to accurately estimate the unknown responses $y_{j}$? Can one still detect the least or most accurate experts? In this work we propose a framework to study these questions, based on the assumption that the $m$ experts have uncorrelated deviations from the optimal predictor. Assuming the first two moments of the response are known, we develop methods to detect the best and worst regressors, and derive U-PCR, a novel principal components approach for unsupervised ensemble regression. We provide theoretical support for U-PCR and illustrate its improved accuracy over the ensemble mean and median on a variety of regression problems.

MLNov 7, 2016
Minimax-optimal semi-supervised regression on unknown manifolds

Amit Moscovich, Ariel Jaffe, Boaz Nadler

We consider semi-supervised regression when the predictor variables are drawn from an unknown manifold. A simple two step approach to this problem is to: (i) estimate the manifold geodesic distance between any pair of points using both the labeled and unlabeled instances; and (ii) apply a k nearest neighbor regressor based on these distance estimates. We prove that given sufficiently many unlabeled points, this simple method of geodesic kNN regression achieves the optimal finite-sample minimax bound on the mean squared error, as if the manifold were known. Furthermore, we show how this approach can be efficiently implemented, requiring only O(k N log N) operations to estimate the regression function at all N labeled and unlabeled points. We illustrate this approach on two datasets with a manifold structure: indoor localization using WiFi fingerprints and facial pose estimation. In both cases, geodesic kNN is more accurate and much faster than the popular Laplacian eigenvector regressor.

MLFeb 6, 2016
A Deep Learning Approach to Unsupervised Ensemble Learning

Uri Shaham, Xiuyuan Cheng, Omer Dror et al.

We show how deep learning methods can be applied in the context of crowdsourcing and unsupervised ensemble learning. First, we prove that the popular model of Dawid and Skene, which assumes that all classifiers are conditionally independent, is {\em equivalent} to a Restricted Boltzmann Machine (RBM) with a single hidden node. Hence, under this model, the posterior probabilities of the true labels can be instead estimated via a trained RBM. Next, to address the more general case, where classifiers may strongly violate the conditional independence assumption, we propose to apply RBM-based Deep Neural Net (DNN). Experimental results on various simulated and real-world datasets demonstrate that our proposed DNN approach outperforms other state-of-the-art methods, in particular when the data violates the conditional independence assumption.

LGOct 20, 2015
Unsupervised Ensemble Learning with Dependent Classifiers

Ariel Jaffe, Ethan Fetaya, Boaz Nadler et al.

In unsupervised ensemble learning, one obtains predictions from multiple sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to assess it. The task is to combine these possibly conflicting predictions into an accurate meta-learner. Most works to date assumed perfect diversity between the different sources, a property known as conditional independence. In realistic scenarios, however, this assumption is often violated, and ensemble learners based on it can be severely sub-optimal. The key challenges we address in this paper are:\ (i) how to detect, in an unsupervised manner, strong violations of conditional independence; and (ii) construct a suitable meta-learner. To this end we introduce a statistical model that allows for dependencies between classifiers. Our main contributions are the development of novel unsupervised methods to detect strongly dependent classifiers, better estimate their accuracies, and construct an improved meta-learner. Using both artificial and real datasets, we showcase the importance of taking classifier dependencies into account and the competitive performance of our approach.

CVMay 25, 2015
Fast Detection of Curved Edges at Low SNR

Nati Ofir, Meirav Galun, Boaz Nadler et al.

Detecting edges is a fundamental problem in computer vision with many applications, some involving very noisy images. While most edge detection methods are fast, they perform well only on relatively clean images. Indeed, edges in such images can be reliably detected using only local filters. Detecting faint edges under high levels of noise cannot be done locally at the individual pixel level, and requires more sophisticated global processing. Unfortunately, existing methods that achieve this goal are quite slow. In this paper we develop a novel multiscale method to detect curved edges in noisy images. While our algorithm searches for edges over a huge set of candidate curves, it does so in a practical runtime, nearly linear in the total number of image pixels. As we demonstrate experimentally, our algorithm is orders of magnitude faster than previous methods designed to deal with high noise levels. Nevertheless, it obtains comparable, if not better, edge detection quality on a variety of challenging noisy images.

COMay 12, 2015
Detecting the large entries of a sparse covariance matrix in sub-quadratic time

Ofer Shwartz, Boaz Nadler

The covariance matrix of a $p$-dimensional random variable is a fundamental quantity in data analysis. Given $n$ i.i.d. observations, it is typically estimated by the sample covariance matrix, at a computational cost of $O(np^{2})$ operations. When $n,p$ are large, this computation may be prohibitively slow. Moreover, in several contemporary applications, the population matrix is approximately sparse, and only its few large entries are of interest. This raises the following question, at the focus of our work: Assuming approximate sparsity of the covariance matrix, can its large entries be detected much faster, say in sub-quadratic time, without explicitly computing all its $p^{2}$ entries? In this paper, we present and theoretically analyze two randomized algorithms that detect the large entries of an approximately sparse sample covariance matrix using only $O(np\text{ poly log } p)$ operations. Furthermore, assuming sparsity of the population matrix, we derive sufficient conditions on the underlying random variable and on the number of samples $n$, for the sample covariance matrix to satisfy our approximate sparsity requirements. Finally, we illustrate the performance of our algorithms via several simulations.

LGFeb 7, 2015
Learning Parametric-Output HMMs with Two Aliased States

Roi Weiss, Boaz Nadler

In various applications involving hidden Markov models (HMMs), some of the hidden states are aliased, having identical output distributions. The minimality, identifiability and learnability of such aliased HMMs have been long standing problems, with only partial solutions provided thus far. In this paper we focus on parametric-output HMMs, whose output distributions come from a parametric family, and that have exactly two aliased states. For this class, we present a complete characterization of their minimality and identifiability. Furthermore, for a large family of parametric output distributions, we derive computationally efficient and statistically consistent algorithms to detect the presence of aliasing and learn the aliased HMM transition and emission parameters. We illustrate our theoretical analysis by several simulations.

MLJul 29, 2014
Estimating the Accuracies of Multiple Classifiers Without Labeled Data

Ariel Jaffe, Boaz Nadler, Yuval Kluger

In various situations one is given only the predictions of multiple classifiers over a large unlabeled test data. This scenario raises the following questions: Without any labeled data and without any a-priori knowledge about the reliability of these different classifiers, is it possible to consistently and computationally efficiently estimate their accuracies? Furthermore, also in a completely unsupervised manner, can one construct a more accurate unsupervised ensemble classifier? In this paper, focusing on the binary case, we present simple, computationally efficient algorithms to solve these questions. Furthermore, under standard classifier independence assumptions, we prove our methods are consistent and study their asymptotic error. Our approach is spectral, based on the fact that the off-diagonal entries of the classifiers' covariance matrix and 3-d tensor are rank-one. We illustrate the competitive performance of our algorithms via extensive experiments on both artificial and real datasets.

MLJul 10, 2014
On the Optimality of Averaging in Distributed Statistical Learning

Jonathan Rosenblatt, Boaz Nadler

A common approach to statistical learning with big-data is to randomly split it among $m$ machines and learn the parameter of interest by averaging the $m$ individual estimates. In this paper, focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: First, a classical setting where the number of parameters $p$ is fixed, and the number of samples per machine $n\to\infty$. Second, a high-dimensional regime where both $p,n\to\infty$ with $p/n \to κ\in (0,1)$. For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-$p$ setting, under suitable assumptions, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second order error terms, and show that these can be non-negligible, notably for non-linear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which to leading order increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

STJun 16, 2013
Do semidefinite relaxations solve sparse PCA up to the information limit?

Robert Krauthgamer, Boaz Nadler, Dan Vilenchik

Estimating the leading principal components of data, assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were developed for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question is under what conditions can such algorithms recover the sparse principal components? We study this question for a single-spike model with an $\ell_0$-sparse eigenvector, in the asymptotic regime as dimension $p$ and sample size $n$ both tend to infinity. Amini and Wainwright [Ann. Statist. 37 (2009) 2877-2921] proved that for sparsity levels $k\geqΩ(n/\log p)$, no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for $k\leq O(\sqrt{n/\log p})$, diagonal thresholding is consistent. It was further conjectured that an SDP approach may close this gap between computational and information limits. We prove that when $k\geqΩ(\sqrt{n})$, the proposed SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of $\ell_0$-sparsity $k\geqΩ(\sqrt{n})$. Finally, we present empirical results suggesting that up to sparsity levels $k=O(\sqrt{n})$, recovery is possible by a simple covariance thresholding algorithm.

MLMar 13, 2013
Ranking and combining multiple predictors without labeled data

Fabio Parisi, Francesco Strino, Boaz Nadler et al.

In a broad range of classification and decision making problems, one is given the advice or predictions of several classifiers, of unknown reliability, over multiple questions or queries. This scenario is different from the standard supervised setting, where each classifier accuracy can be assessed using available labeled data, and raises two questions: given only the predictions of several classifiers over a large set of unlabeled test data, is it possible to a) reliably rank them; and b) construct a meta-classifier more accurate than most classifiers in the ensemble? Here we present a novel spectral approach to address these questions. First, assuming conditional independence between classifiers, we show that the off-diagonal entries of their covariance matrix correspond to a rank-one matrix. Moreover, the classifiers can be ranked using the leading eigenvector of this covariance matrix, as its entries are proportional to their balanced accuracies. Second, via a linear approximation to the maximum likelihood estimator, we derive the Spectral Meta-Learner (SML), a novel ensemble classifier whose weights are equal to this eigenvector entries. On both simulated and real data, SML typically achieves a higher accuracy than most classifiers in the ensemble and can provide a better starting point than majority voting, for estimating the maximum likelihood solution. Furthermore, SML is robust to the presence of small malicious groups of classifiers designed to veer the ensemble prediction away from the (unknown) ground truth.

LGFeb 25, 2013
On learning parametric-output HMMs

Aryeh Kontorovich, Boaz Nadler, Roi Weiss

We present a novel approach for learning an HMM whose outputs are distributed according to a parametric family. This is done by {\em decoupling} the learning task into two steps: first estimating the output parameters, and then estimating the hidden states transition probabilities. The first step is accomplished by fitting a mixture model to the output stationary distribution. Given the parameters of this mixture model, the second step is formulated as the solution of an easily solvable convex quadratic program. We provide an error analysis for the estimated transition probabilities and show they are robust to small perturbations in the estimates of the mixture parameters. Finally, we support our analysis with some encouraging empirical results.

LGOct 16, 2012
Active Learning with Distributional Estimates

Jens Roeder, Boaz Nadler, Kevin Kunzmann et al.

Active Learning (AL) is increasingly important in a broad range of applications. Two main AL principles to obtain accurate classification with few labeled data are refinement of the current decision boundary and exploration of poorly sampled regions. In this paper we derive a novel AL scheme that balances these two principles in a natural way. In contrast to many AL strategies, which are based on an estimated class conditional probability ^p(y|x), a key component of our approach is to view this quantity as a random variable, hence explicitly considering the uncertainty in its estimated value. Our main contribution is a novel mathematical framework for uncertainty-based AL, and a corresponding AL scheme, where the uncertainty in ^p(y|x) is modeled by a second-order distribution. On the practical side, we show how to approximate such second-order distributions for kernel density classification. Finally, we find that over a large number of UCI, USPS and Caltech4 datasets, our AL scheme achieves significantly better learning curves than popular AL methods such as uncertainty sampling and error reduction sampling, when all use the same kernel density classifier.

NAJun 6, 2005
Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators

Boaz Nadler, Stephane Lafon, Ronald R. Coifman et al.

This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density $p(\x) = e^{-U(\x)}$ we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential $2U(\x)$ with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms.

NAMar 22, 2005
Diffusion maps, spectral clustering and reaction coordinates of dynamical systems

Boaz Nadler, Stephane Lafon, Ronald R. Coifman et al.

A central problem in data analysis is the low dimensional representation of high dimensional data, and the concise description of its underlying geometry and density. In the analysis of large scale simulations of complex dynamical systems, where the notion of time evolution comes into play, important problems are the identification of slow variables and dynamically meaningful reaction coordinates that capture the long time evolution of the system. In this paper we provide a unifying view of these apparently different tasks, by considering a family of {\em diffusion maps}, defined as the embedding of complex (high dimensional) data onto a low dimensional Euclidian space, via the eigenvectors of suitably defined random walks defined on the given datasets. Assuming that the data is randomly sampled from an underlying general probability distribution $p(\x)=e^{-U(\x)}$, we show that as the number of samples goes to infinity, the eigenvectors of each diffusion map converge to the eigenfunctions of a corresponding differential operator defined on the support of the probability distribution. Different normalizations of the Markov chain on the graph lead to different limiting differential operators. One normalization gives the Fokker-Planck operators with the same potential U(x), best suited for the study of stochastic differential equations as well as for clustering. Another normalization gives the Laplace-Beltrami (heat) operator on the manifold in which the data resides, best suited for the analysis of the geometry of the dataset, regardless of its possibly non-uniform density.