Daniel Holtmann-Rice

h-index11

6papers

415citations

Novelty50%

AI Score27

Ranked #154,664 of 194,257 authors (top 80%)#50,286 in CV (top 85%)

6 Papers

17.8LGOct 16, 2018

Stochastic Negative Mining for Learning with Large Output Spaces

Sashank J. Reddi, Satyen Kale, Felix Yu et al.

We consider the problem of retrieving the most relevant labels for a given input when the size of the output space is very large. Retrieval methods are modeled as set-valued classifiers which output a small set of classes for each input, and a mistake is made if the label is not in the output set. Despite its practical importance, a statistically principled, yet practical solution to this problem is largely missing. To this end, we first define a family of surrogate losses and show that they are calibrated and convex under certain conditions on the loss parameters and data distribution, thereby establishing a statistical and analytical basis for using these losses. Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i.e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining. We also provide generalization error bounds for the losses in the family. Finally, we conduct experiments which demonstrate that Stochastic Negative Mining yields benefits over commonly used negative sampling approaches.

13.9MLJun 26, 2018Code

Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling

Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi et al.

Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. The convex $\ell_1$ decoder prevents gradient propagation as needed in standard gradient-based training. Our method is based on the insight that unrolling the convex decoder into $T$ projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets; our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification, and empirically show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches.

13.4MLNov 15, 2017

Lattice Rescoring Strategies for Long Short Term Memory Language Models in Speech Recognition

Shankar Kumar, Michael Nirschl, Daniel Holtmann-Rice et al.

Recurrent neural network (RNN) language models (LMs) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks. However, these models are computationally more expensive than N-gram LMs for decoding, and thus, challenging to integrate into speech recognizers. Recent research has proposed the use of lattice-rescoring algorithms using RNNLMs and LSTMLMs as an efficient strategy to integrate these models into a speech recognition system. In this paper, we evaluate existing lattice rescoring algorithms along with new variants on a YouTube speech recognition task. Lattice rescoring using LSTMLMs reduces the word error rate (WER) for this task by 8\% relative to the WER obtained using an N-gram LM.

3.1CVMay 16, 2017

Tensors, Differential Geometry and Statistical Shading Analysis

Daniel Niels Holtmann-Rice, Benjamin S. Kunsberg, Steven W. Zucker

We develop a linear algebraic framework for the shape-from-shading problem, because tensors arise when scalar (e.g. image) and vector (e.g. surface normal) fields are differentiated multiple times. Using this framework, we first investigate when image derivatives exhibit invariance to changing illumination by calculating the statistics of image derivatives under general distributions on the light source. Second, we apply that framework to develop Taylor-like expansions, and build a boot-strapping algorithm to find the polynomial surface solutions (under any light source) consistent with a given patch to arbitrary order. A generic constraint on the light source restricts these solutions to a 2-D subspace, plus an unknown rotation matrix. It is this unknown matrix that encapsulates the ambiguity in the problem. Finally, we use the framework to computationally validate the hypothesis that image orientations (derivatives) provide increased invariance to illumination by showing (for a Lambertian model) that a shape-from-shading algorithm matching gradients instead of intensities provides more accurate reconstructions when illumination is incorrectly estimated under a flatness prior.

3.8CVMay 16, 2017

What's In A Patch, I: Tensors, Differential Geometry and Statistical Shading Analysis

Daniel Niels Holtmann-Rice, Benjamin S. Kunsberg, Steven W. Zucker

28.2LGOct 28, 2016

Orthogonal Random Features

Felix X. Yu, Ananda Theertha Suresh, Krzysztof Choromanski et al.

We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from $\mathcal{O}(d^2)$ to $\mathcal{O}(d \log d)$, where $d$ is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications.