David P. Helmbold

h-index33

8papers

224citations

Novelty53%

AI Score27

Ranked #157,037 of 194,257 authors (top 81%)#34,429 in LG (top 86%)

8 Papers

1.0LGOct 10, 2019

Online Learning Using Only Peer Prediction

Yang Liu, David P. Helmbold

This paper considers a variant of the classical online learning problem with expert predictions. Our model's differences and challenges are due to lacking any direct feedback on the loss each expert incurs at each time step $t$. We propose an approach that uses peer prediction and identify conditions where it succeeds. Our techniques revolve around a carefully designed peer score function $s()$ that scores experts' predictions based on the peer consensus. We show a sufficient condition, that we call \emph{peer calibration}, under which standard online learning algorithms using loss feedback computed by the carefully crafted $s()$ have bounded regret with respect to the unrevealed ground truth values. We then demonstrate how suitable $s()$ functions can be derived for different assumptions and models.

4.8LGFeb 1, 2019Code

A New Family of Neural Networks Provably Resistant to Adversarial Attacks

Rakshit Agrawal, Luca de Alfaro, David Helmbold

Adversarial attacks add perturbations to the input features with the intent of changing the classification produced by a machine learning system. Small perturbations can yield adversarial examples which are misclassified despite being virtually indistinguishable from the unperturbed input. Classifiers trained with standard neural network techniques are highly susceptible to adversarial examples, allowing an adversary to create misclassifications of their choice. We introduce a new type of network unit, called MWD (max of weighed distance) units that have a built-in resistant to adversarial attacks. These units are highly non-linear, and we develop the techniques needed to effectively train them. We show that simple interval techniques for propagating perturbation effects through the network enables the efficient computation of robustness (i.e., accuracy guarantees) for MWD networks under any perturbations, including adversarial attacks. MWD networks are significantly more robust to input perturbations than ReLU networks. On permutation invariant MNIST, when test examples can be perturbed by 20% of the input range, MWD networks provably retain accuracy above 83%, while the accuracy of ReLU networks drops below 5%. The provable accuracy of MWD networks is superior even to the observed accuracy of ReLU networks trained with the help of adversarial examples. In the absence of adversarial attacks, MWD networks match the performance of sigmoid networks, and have accuracy only slightly below that of ReLU networks.

18.3LGFeb 16, 2018

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

Peter L. Bartlett, David P. Helmbold, Philip M. Long

We analyze algorithms for approximating a function $f(x) = Φx$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $Θ_1,...,Θ_L$ and defined by $h(x) = Θ_L Θ_{L-1} ... Θ_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $Φ$, in the case where the initial hypothesis $Θ_1 = ... = Θ_L = I$ has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $Φ$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If $Φ$ is symmetric positive definite, we show that an algorithm that initializes $Θ_i = I$ learns an $ε$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $Φ$, and $\log(d/ε)$. In contrast, we show that if the least squares matrix $Φ$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that $Φ$ satisfies $u^{\top} Φu > 0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} Θ_L Θ_{L-1} ... Θ_1 u > 0$ for all $u$, and another that "balances" $Θ_1, ..., Θ_L$ so that they have the same singular values.

3.5LGSep 17, 2016

Online Learning of Combinatorial Objects via Extended Formulation

Holakou Rahmanian, David P. Helmbold, S. V. N. Vishwanathan

The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of $n$-symbol Huffman trees is known to have exponentially many facets (Maurras et al., 2010). We get around this difficulty by exploiting extended formulations (Kaibel, 2011), which encode the polytope of combinatorial objects in a higher dimensional "extended" space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of $O(\sqrt{\log(n)})$ of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.

11.0LGFeb 14, 2016

Surprising properties of dropout in deep networks

David P. Helmbold, Philip M. Long

We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow exponentially in the depth of the network while the weight-decay penalty remains essentially linear, and that dropout is insensitive to various re-scalings of the input features, outputs, and network weights. This last insensitivity implies that there are no isolated local minima of the dropout training criterion. Our work uncovers new properties of dropout, extends our understanding of why dropout succeeds, and lays the foundation for further progress.

22.6LGDec 15, 2014

On the Inductive Bias of Dropout

David P. Helmbold, Philip M. Long

Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization.

3.1CVSep 4, 2013

Boosting in Location Space

Damian Eads, David Helmbold, Ed Rosten

The goal of object detection is to find objects in an image. An object detector accepts an image and produces a list of locations as $(x,y)$ pairs. Here we introduce a new concept: {\bf location-based boosting}. Location-based boosting differs from previous boosting algorithms because it optimizes a new spatial loss function to combine object detectors, each of which may have marginal performance, into a single, more accurate object detector. A structured representation of object locations as a list of $(x,y)$ pairs is a more natural domain for object detection than the spatially unstructured representation produced by classifiers. Furthermore, this formulation allows us to take advantage of the intuition that large areas of the background are uninteresting and it is not worth expending computational effort on them. This results in a more scalable algorithm because it does not need to take measures to prevent the background data from swamping the foreground data such as subsampling or applying an ad-hoc weighting to the pixels. We first present the theory of location-based boosting, and then motivate it with empirical results on a challenging data set.

3.8LGMar 12, 2012

On the Necessity of Irrelevant Variables

David P. Helmbold, Philip M. Long

This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.