Mary Wootters

8papers

617citations

Novelty57%

AI Score44

Ranked #72,026 of 201,326 authors (top 36%)#16,322 in LG (top 38%)

8 Papers

LGJun 16, 2022

Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence

Margalit Glasgow, Colin Wei, Mary Wootters et al. · stanford

A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter, and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - ε)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. Building on the impossibility results of Nagarajan and Kolter, under slightly stronger assumptions, we show that one-sided UC bounds and classical margin bounds will fail on near-max-margin classifiers. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.

99.0DMApr 10

On Worst-Case Optimal Polynomial Intersection

Yihang Sun, Mary Wootters

The Optimal Polynomial Intersection (OPI) problem is the following: Given sets $S_1, \ldots, S_m \subseteq \mathbb{F}$ and evaluation points $a_1, \ldots, a_m \in \mathbb{F}$, find a polynomial $Q \in \mathbb{F}[x]$ of degree less than $n$ so that $Q(a_i) \in S_i$ for as many $i \in \{1, 2, \ldots, m\}$ as possible. Decoded Quantum Interferometry (DQI) is a quantum algorithm that efficiently returns good solutions to the problem, even on worst-case instances (Jordan et. al., 2025). The quality of the solutions returned follows a semicircle law, which outperforms known efficient classical algorithms. But does DQI obtain the best possible solutions? That is, are there solutions better than the semicircle law for worst-case OPI instances? Surprisingly, before this work, the best existential results coincide with (and follow from) the best algorithmic results. In this work, we show that there are better solutions for worst-case OPI instances over prime fields. In particular, DQI and the semicircle law are not optimal. For example, when the lists $S_i$ have size $Ïp$ for $Ï\sim 1/2$, our results imply the existence of a solution that asymptotically beats the semicircle law whenever $n/m \geq 0.6225$, and we show that an asymptotically perfect solution exists whenever $n/m \geq 0.7496$. Our results generalize to Max-LINSAT problems derived from any Maximum Distance Separable (MDS) code, and to any $Ï\in (0,1)$. The key insight to our improvement is a connection to local leakage resilience of secret sharing schemes. Along the way, we recover several re-proofs of the existence of solutions achieving the semicircle law.

ITNov 19, 2021

On the Download Rate of Homomorphic Secret Sharing

Ingerid Fosli, Yuval Ishai, Victor I. Kolobov et al.

A homomorphic secret sharing (HSS) scheme is a secret sharing scheme that supports evaluating functions on shared secrets by means of a local mapping from input shares to output shares. We initiate the study of the download rate of HSS, namely, the achievable ratio between the length of the output shares and the output length when amortized over $\ell$ function evaluations. We obtain the following results. * In the case of linear information-theoretic HSS schemes for degree-$d$ multivariate polynomials, we characterize the optimal download rate in terms of the optimal minimal distance of a linear code with related parameters. We further show that for sufficiently large $\ell$ (polynomial in all problem parameters), the optimal rate can be realized using Shamir's scheme, even with secrets over $\mathbb{F}_2$. * We present a general rate-amplification technique for HSS that improves the download rate at the cost of requiring more shares. As a corollary, we get high-rate variants of computationally secure HSS schemes and efficient private information retrieval protocols from the literature. * We show that, in some cases, one can beat the best download rate of linear HSS by allowing nonlinear output reconstruction and $2^{-Ω(\ell)}$ error probability.

CRAug 17, 2021

A Note on the Permuted Puzzles Toy Conjecture

Keller Blackwell, Mary Wootters

In this note, we show that a "Toy Conjecture" made by (Boyle, Ishai, Pass, Wootters, 2017) is false, and propose a new one. Our attack does not falsify the full ("non-toy") conjecture in that work, and it is our hope that this note will help further the analysis of that conjecture. Independently, (Boyle, Holmgren, Ma, Weiss, 2021) have obtained similar results.

LGSep 22, 2020

Asynchronous Distributed Optimization with Stochastic Delays

Margalit Glasgow, Mary Wootters

We study asynchronous finite sum minimization in a distributed-data setting with a central parameter server. While asynchrony is well understood in parallel settings where the data is accessible by all machines -- e.g., modifications of variance-reduced gradient algorithms like SAGA work well -- little is known for the distributed-data setting. We develop an algorithm ADSAGA based on SAGA for the distributed-data setting, in which the data is partitioned between many machines. We show that with $m$ machines, under a natural stochastic delay model with an mean delay of $m$, ADSAGA converges in $\tilde{O}\left(\left(n + \sqrt{m}κ\right)\log(1/ε)\right)$ iterations, where $n$ is the number of component functions, and $κ$ is a condition number. This complexity sits squarely between the complexity $\tilde{O}\left(\left(n + κ\right)\log(1/ε)\right)$ of SAGA \textit{without delays} and the complexity $\tilde{O}\left(\left(n + mκ\right)\log(1/ε)\right)$ of parallel asynchronous algorithms where the delays are \textit{arbitrary} (but bounded by $O(m)$), and the data is accessible by all. Existing asynchronous algorithms with distributed-data setting and arbitrary delays have only been shown to converge in $\tilde{O}(n^2κ\log(1/ε))$ iterations. We empirically compare on least-squares problems the iteration complexity and wallclock performance of ADSAGA to existing parallel and distributed algorithms, including synchronous minibatch algorithms. Our results demonstrate the wallclock advantage of variance-reduced asynchronous approaches over SGD or synchronous approaches.

MLJun 17, 2020

Approximate Gradient Coding with Optimal Decoding

Margalit Glasgow, Mary Wootters

In distributed optimization problems, a technique called gradient coding, which involves replicating data points, has been used to mitigate the effect of straggling machines. Recent work has studied approximate gradient coding, which concerns coding schemes where the replication factor of the data is too low to recover the full gradient exactly. Our work is motivated by the challenge of creating approximate gradient coding schemes that simultaneously work well in both the adversarial and stochastic models. To that end, we introduce novel approximate gradient codes based on expander graphs, in which each machine receives exactly two blocks of data points. We analyze the decoding error both in the random and adversarial straggler setting, when optimal decoding coefficients are used. We show that in the random setting, our schemes achieve an error to the gradient that decays exponentially in the replication factor. In the adversarial setting, the error is nearly a factor of two smaller than any existing code with similar performance in the random setting. We show convergence bounds both in the random and adversarial setting for gradient descent under standard assumptions using our codes. In the random setting, our convergence rate improves upon block-box bounds. In the adversarial setting, we show that gradient descent can converge down to a noise floor that scales linearly with the adversarial error to the gradient. We demonstrate empirically that our schemes achieve near-optimal error in the random setting and converge faster than algorithms which do not use the optimal decoding coefficients.

LGJun 23, 2015

Strategic Classification

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou et al.

Machine learning relies on the assumption that unseen test instances of a classification problem follow the same distribution as observed training data. However, this principle can break down when machine learning is used to make important decisions about the welfare (employment, education, health) of strategic individuals. Knowing information about the classifier, such individuals may manipulate their attributes in order to obtain a better classification outcome. As a result of this behavior---often referred to as gaming---the performance of the classifier may deteriorate sharply. Indeed, gaming is a well-known obstacle for using machine learning methods in practice; in financial policy-making, the problem is widely known as Goodhart's law. In this paper, we formalize the problem, and pursue algorithms for learning classifiers that are robust to gaming. We model classification as a sequential game between a player named "Jury" and a player named "Contestant." Jury designs a classifier, and Contestant receives an input to the classifier, which he may change at some cost. Jury's goal is to achieve high classification accuracy with respect to Contestant's original input and some underlying target classification function. Contestant's goal is to achieve a favorable classification outcome while taking into account the cost of achieving it. For a natural class of cost functions, we obtain computationally efficient learning algorithms which are near-optimal. Surprisingly, our algorithms are efficient even on concept classes that are computationally hard to learn. For general cost functions, designing an approximately optimal strategy-proof classifier, for inverse-polynomial approximation, is NP-hard.

LGJul 15, 2014

Fast matrix completion without the condition number

Moritz Hardt, Mary Wootters

We give the first algorithm for Matrix Completion whose running time and sample complexity is polynomial in the rank of the unknown target matrix, linear in the dimension of the matrix, and logarithmic in the condition number of the matrix. To the best of our knowledge, all previous algorithms either incurred a quadratic dependence on the condition number of the unknown matrix or a quadratic dependence on the dimension of the matrix in the running time. Our algorithm is based on a novel extension of Alternating Minimization which we show has theoretical guarantees under standard assumptions even in the presence of noise.