63.4DSMay 24
New and Improved Bounds for Markov PagingChirag Pabbaraju, Ali Vakilian
In the Markov paging model, one assumes that page requests are drawn from a Markov chain over the pages in memory, and the goal is to maintain a fast cache that suffers few page faults in expectation. While computing the optimal online algorithm $(\mathrm{OPT})$ for this problem naively takes time exponential in the size of the cache, the best-known polynomial-time approximation algorithm is the dominating distribution algorithm due to Lund, Phillips and Reingold (FOCS 1994), who showed that the algorithm is $4$-competitive against $\mathrm{OPT}$. We substantially improve their analysis and show that the dominating distribution algorithm is in fact $2$-competitive against $\mathrm{OPT}$. We also show a lower bound of $1.5907$-competitiveness for this algorithm -- to the best of our knowledge, no such lower bound was previously known.
40.4LGJun 1
From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online OptimizationMoses Charikar, Chirag Pabbaraju, Ambuj Tewari
Curvature adaptivity is a classical theme in online optimization: for convex Lipschitz losses, adaptive methods interpolate between the optimal $O(\sqrt{T})$ regret for general convex losses and $O(\log T)$ regret under strong convexity. Recent work has shown that Follow-the-Perturbed-Leader (FTPL) achieves optimal $O(\sqrt{T})$ regret even for online non-convex Lipschitz losses, assuming access to an approximate offline-optimization oracle, but these guarantees do not exploit curvature. We show that FTPL can be made curvature-adaptive in the non-convex setting, without knowing in advance how curvature will accumulate over time. Our algorithm replaces the fixed perturbation scale of standard FTPL with a time-varying scale chosen using only past information. We give a simple follow-the-leader tuning rule for this scale and show that it competes, up to constants, with the best choice in hindsight. The resulting method achieves $O(\sqrt{T})$ regret for arbitrary non-convex Lipschitz losses and improves as cumulative curvature grows; with sufficiently accurate oracle calls, it achieves $O(\log T)$ regret when cumulative curvature grows linearly, which includes the classical strongly convex regime. We complement these upper bounds with matching lower bounds for prescribed cumulative-curvature sequences, already for one-dimensional convex losses, showing that the tradeoff between worst-case non-convex regret and curvature-driven fast rates is intrinsic.
MLNov 7, 2022
A Characterization of List LearnabilityMoses Charikar, Chirag Pabbaraju
A classical result in learning theory shows the equivalence of PAC learnability of binary hypothesis classes and the finiteness of VC dimension. Extending this to the multiclass setting was an open problem, which was settled in a recent breakthrough result characterizing multiclass PAC learnability via the DS dimension introduced earlier by Daniely and Shalev-Shwartz. In this work we consider list PAC learning where the goal is to output a list of $k$ predictions. List learning algorithms have been developed in several settings before and indeed, list learning played an important role in the recent characterization of multiclass learnability. In this work we ask: when is it possible to $k$-list learn a hypothesis class? We completely characterize $k$-list learnability in terms of a generalization of DS dimension that we call the $k$-DS dimension. Generalizing the recent characterization of multiclass learnability, we show that a hypothesis class is $k$-list learnable if and only if the $k$-DS dimension is finite.
LGJun 3, 2023
Provable benefits of score matchingChirag Pabbaraju, Dhruv Rohatgi, Anish Sevekari et al.
Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood -- both computational and statistical -- are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension.
LGOct 1, 2022
Pitfalls of Gaussians as a noise distribution in NCEHolden Lee, Chirag Pabbaraju, Anish Sevekari et al.
Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution $q$, in a manner that avoids having to calculate a partition function. It is well-known that the choice of $q$ can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for $q$ is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of $q$ will be problematic in practice, suggesting that more complex noise distributions are essential to the success of NCE.
100.0DSMar 29
An Optimal Algorithm for Stochastic Vertex CoverJan van den Brand, Inge Li Gørtz, Chirag Pabbaraju et al.
The goal in the stochastic vertex cover problem is to obtain an approximately minimum vertex cover for a graph $G^\star$ that is realized by sampling each edge independently with some probability $p\in (0, 1]$ in a base graph $G = (V, E)$. The algorithm is given the base graph $G$ and the probability $p$ as inputs, but its only access to the realized graph $G^\star$ is through queries on individual edges in $G$ that reveal the existence (or not) of the queried edge in $G^\star$. In this paper, we resolve the central open question for this problem: to find a $(1+\varepsilon)$-approximate vertex cover using only $O_\varepsilon(n/p)$ edge queries. Prior to our work, there were two incomparable state-of-the-art results for this problem: a $(3/2+\varepsilon)$-approximation using $O_\varepsilon(n/p)$ queries (Derakhshan, Durvasula, and Haghtalab, 2023) and a $(1+\varepsilon)$-approximation using $O_\varepsilon((n/p)\cdot \mathrm{RS}(n))$ queries (Derakhshan, Saneian, and Xun, 2025), where $\mathrm{RS}(n)$ is known to be at least $2^{Ω\left(\frac{\log n}{\log \log n}\right)}$ and could be as large as $\frac{n}{2^{Î(\log^* n)}}$. Our improved upper bound of $O_{\varepsilon}(n/p)$ matches the known lower bound of $Ω(n/p)$ for any constant-factor approximation algorithm for this problem (Behnezhad, Blum, and Derakhshan, 2022). A key tool in our result is a new concentration bound for the size of minimum vertex cover on random graphs, which might be of independent interest.
LGAug 12, 2023
Multiclass Learnability Does Not Imply Sample CompressionChirag Pabbaraju
A hypothesis class admits a sample compression scheme, if for every sample labeled by a hypothesis from the class, it is possible to retain only a small subsample, using which the labels on the entire sample can be inferred. The size of the compression scheme is an upper bound on the size of the subsample produced. Every learnable binary hypothesis class (which must necessarily have finite VC dimension) admits a sample compression scheme of size only a finite function of its VC dimension, independent of the sample size. For multiclass hypothesis classes, the analog of VC dimension is the DS dimension. We show that the analogous statement pertaining to sample compression is not true for multiclass hypothesis classes: every learnable multiclass hypothesis class, which must necessarily have finite DS dimension, does not admit a sample compression scheme of size only a finite function of its DS dimension.
DSNov 19, 2023
Testing with Non-identically Distributed SamplesShivam Garg, Chirag Pabbaraju, Kirankumar Shiragur et al.
We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{avg}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Θ(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{avg}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity -- distinguishing the case that $p_{avg}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $ρ> 0$ such that even in the linear regime with $ρk$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.
LGOct 2, 2023
Harnessing the Power of Choices in Decision Tree LearningGuy Blanc, Jane Lange, Chirag Pabbaraju et al.
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a {\sl greediness hierarchy theorem} showing that for every $k \in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\varepsilon$, whereas the latter only achieves accuracy $\frac1{2}+\varepsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
LGJan 30
Agnostic Language Identification and GenerationMikael Møller Høgsgaard, Chirag Pabbaraju
Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.
CLNov 6, 2025
A Characterization of List Language Identification in the LimitMoses Charikar, Chirag Pabbaraju, Ambuj Tewari
We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.
LGSep 28, 2024
A Characterization of List RegressionChirag Pabbaraju, Sahasrajit Sarmasarkar
There has been a recent interest in understanding and characterizing the sample complexity of list learning tasks, where the learning algorithm is allowed to make a short list of $k$ predictions, and we simply require one of the predictions to be correct. This includes recent works characterizing the PAC sample complexity of standard list classification and online list classification. Adding to this theme, in this work, we provide a complete characterization of list PAC regression. We propose two combinatorial dimensions, namely the $k$-OIG dimension and the $k$-fat-shattering dimension, and show that they characterize realizable and agnostic $k$-list regression respectively. These quantities generalize known dimensions for standard regression. Our work thus extends existing list learning characterizations from classification to regression.
LGFeb 23
The Sample Complexity of Replicable Realizable PAC LearningKasper Green Larsen, Markus Engelund Mathiasen, Chirag Pabbaraju et al.
In this paper, we consider the problem of replicable realizable PAC learning. We construct a particularly hard learning problem and show a sample complexity lower bound with a close to $(\log|H|)^{3/2}$ dependence on the size of the hypothesis class $H$. Our proof uses several novel techniques and works by defining a particular Cayley graph associated with $H$ and analyzing a suitable random walk on this graph by examining the spectral properties of its adjacency matrix. Furthermore, we show an almost matching upper bound for the lower bound instance, meaning if a stronger lower bound exists, one would have to consider a different instance of the problem.
LGJan 5
Learning with Monotone Adversarial CorruptionsKasper Green Larsen, Chirag Pabbaraju, Abhishek Shetty
We study the extent to which standard machine learning algorithms rely on exchangeability and independence of data by introducing a monotone adversarial corruption model. In this model, an adversary, upon looking at a "clean" i.i.d. dataset, inserts additional "corrupted" points of their choice into the dataset. These added points are constrained to be monotone corruptions, in that they get labeled according to the ground-truth target function. Perhaps surprisingly, we demonstrate that in this setting, all known optimal learning algorithms for binary classification can be made to achieve suboptimal expected error on a new independent test point drawn from the same distribution as the clean dataset. On the other hand, we show that uniform convergence-based algorithms do not degrade in their guarantees. Our results showcase how optimal learning algorithms break down in the face of seemingly helpful monotone corruptions, exposing their overreliance on exchangeability.
56.9LGApr 27
The Optimal Sample Complexity of Multiclass and List LearningChirag Pabbaraju
While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of $\sqrt{\text{DS}}$ has persisted between the upper and lower bounds on sample complexity. Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning.
DSNov 22, 2024
Exploring Facets of Language Generation in the LimitMoses Charikar, Chirag Pabbaraju
The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Raman & Tewari [RT24] studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved -- namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit. However, while the generation algorithm of [KM24] can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of a complexity measure of the collection.
LGJan 31, 2025
Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared LossAbhijeet Mulgund, Chirag Pabbaraju
The paradigm of weak-to-strong generalization constitutes the training of a strong AI model on data labeled by a weak AI model, with the goal that the strong model nevertheless outperforms its weak supervisor on the target task of interest. For the setting of real-valued regression with the squared loss, recent work quantitatively characterizes the gain in performance of the strong model over the weak model in terms of the misfit between the strong and weak model. We generalize such a characterization to learning tasks whose loss functions correspond to arbitrary Bregman divergences when the strong class is convex. This extends the misfit-based characterization of performance gain in weak-to-strong generalization to classification tasks, as the cross-entropy loss can be expressed in terms of a Bregman divergence. In most practical scenarios, however, the strong model class may not be convex. We therefore weaken this assumption and study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class, in the concrete setting of classification. This allows us to obtain a similar misfit-based characterization of performance gain, upto an additional error term that vanishes as $k$ gets large. Our theoretical findings are supported by thorough experiments on synthetic as well as real-world datasets.
DSOct 3, 2025
Pareto-optimal Non-uniform Language GenerationMoses Charikar, Chirag Pabbaraju
Kleinberg and Mullainathan (2024) recently proposed an interesting model for language generation in the limit: Given a countable collection of languages, and an adversary enumerating the strings of some language $L$ from the collection, the objective is to generate new strings from the target language, such that all strings generated beyond some finite time are valid. Li, Raman and Tewari (2024) and Charikar and Pabbaraju (2024) showed strong non-uniform generation guarantees in this model, giving algorithms that generate new valid strings from $L$ after seeing a number of distinct input strings $t(L)$ that depends only on $L$ (and the collection), but not the enumeration order. However, for both these works, the language-wise generation times $t(L)$ of the algorithm can be strictly sub-optimal. In this work, we study Pareto-optimality of non-uniform language generation in the limit. We propose an algorithm, whose generation times $t^\star(L)$ are (almost) Pareto-optimal: any other algorithm whose generation time for some language $L$ is strictly smaller than $t^\star(L)$, must satisfy that its generation time for some other language $L'$ is strictly worse than $t^\star(L')$. Pareto-optimality is essentially the best that one can achieve for non-uniform generation. Our algorithmic framework conveniently adapts to further give Pareto-optimal non-uniform generation algorithms in the practically motivated settings of noisy as well as representative generation.
MLMay 6, 2025
Lower Bounds for Greedy Teaching Set ConstructionsSpencer Compton, Chirag Pabbaraju, Nikita Zhivotovskiy
A fundamental open problem in learning theory is to characterize the best-case teaching dimension $\operatorname{TS}_{\min}$ of a concept class $\mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon and Zilles; COLT 2015]. Prior work used a natural greedy algorithm to construct teaching sets recursively, thereby proving upper bounds on $\operatorname{TS}_{\min}$, with the best known bound being $O(d^2)$ [Hu, Wu, Li, and Wang; COLT 2017]. In each iteration, this greedy algorithm chooses to add to the teaching set the $k$ labeled points that restrict the concept class the most. In this work, we prove lower bounds on the performance of this greedy approach for small $k$. Specifically, we show that for $k = 1$, the algorithm does not improve upon the halving-based bound of $O(\log(|\mathcal{C}|))$. Furthermore, for $k = 2$, we complement the upper bound of $O\left(\log(\log(|\mathcal{C}|))\right)$ from [Moran, Shpilka, Wigderson, and Yuhudayoff; FOCS 2015] with a matching lower bound. Most consequentially, our lower bound extends up to $k \le \lceil c d \rceil$ for small constant $c>0$: suggesting that studying higher-order interactions may be necessary to resolve the conjecture that $\operatorname{TS}_{\min} = O(d)$.
LGApr 11, 2025
Proofs as Explanations: Short Certificates for Reliable PredictionsAvrim Blum, Steve Hanneke, Chirag Pabbaraju et al.
We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h' \in H$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a proof that $x$ indeed has label $y$ under the assumption that (1) the target function $h^\star$ belongs to $H$, and (2) the set $S$ contains at most $b$ corrupted points. For example, if $b=0$ and $H$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and hence every consistent linear classifier labels $x$ as positive), then Carathéodory's theorem states that $x$ lies inside the convex hull of $d+1$ of those points. So, a set $S'$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of realizability. In this work, we consider this problem more generally, for general hypothesis classes $H$ and general values $b\geq 0$. We define the notion of the robust hollow star number of $H$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as distribution-dependent bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the certificate coefficient $\varepsilon_x$ of an example $x$ with respect to a data distribution $D$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\varepsilon_x$, $b$, and the VC dimension $d$ of $H$.
LGJun 22, 2024
Credit Attribution and Stable CompressionRoi Livni, Shay Moran, Kobbi Nissim et al.
Credit attribution is crucial across various fields. In academic research, proper citation acknowledges prior work and establishes original contributions. Similarly, in generative models, such as those trained on existing artworks or music, it is important to ensure that any generated content influenced by these works appropriately credits the original creators. We study credit attribution by machine learning algorithms. We propose new definitions--relaxations of Differential Privacy--that weaken the stability guarantees for a designated subset of $k$ datapoints. These $k$ datapoints can be used non-stably with permission from their owners, potentially in exchange for compensation. Meanwhile, the remaining datapoints are guaranteed to have no significant influence on the algorithm's output. Our framework extends well-studied notions of stability, including Differential Privacy ($k = 0$), differentially private learning with public data (where the $k$ public datapoints are fixed in advance), and stable sample compression (where the $k$ datapoints are selected adaptively by the algorithm). We examine the expressive power of these stability notions within the PAC learning framework, provide a comprehensive characterization of learnability for algorithms adhering to these principles, and propose directions and questions for future research.
LGJul 7, 2021
Universal Approximation for Log-concave Distributions using Well-conditioned Normalizing FlowsHolden Lee, Chirag Pabbaraju, Anish Sevekari et al.
Normalizing flows are a widely used class of latent-variable generative models with a tractable likelihood. Affine-coupling (Dinh et al, 2014-16) models are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affine couplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers (Huang et al.,2020;Zhang et al.,2020;Koehler et al.,2020) -- who showed reasonably regular distributions can be approximated arbitrarily well using affine couplings -- albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows? In this paper, we show that any log-concave distribution can be approximated using well-conditioned affine-coupling flows. In terms of proof techniques, we uncover and leverage deep connections between affine coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and Hénon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). Our results also inform the practice of training affine couplings: we approximate a padded version of the input distribution with iid Gaussians -- a strategy which Koehler et al.(2020) empirically observed to result in better-conditioned flows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the benefits of Gaussian padding when training normalizing flows.
LGDec 4, 2020
Efficient semidefinite-programming-based inference for binary and multi-class MRFsChirag Pabbaraju, Po-Wei Wang, J. Zico Kolter
Probabilistic inference in pairwise Markov Random Fields (MRFs), i.e. computing the partition function or computing a MAP estimate of the variables, is a foundational problem in probabilistic graphical models. Semidefinite programming relaxations have long been a theoretically powerful tool for analyzing properties of probabilistic inference, but have not been practical owing to the high computational cost of typical solvers for solving the resulting SDPs. In this paper, we propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF by instead exploiting a recently proposed coordinate-descent-based fast semidefinite solver. We also extend semidefinite relaxations from the typical binary MRF to the full multi-class setting, and develop a compact semidefinite relaxation that can again be solved efficiently using the solver. We show that the method substantially outperforms (both in terms of solution quality and speed) the existing state of the art in approximate inference, on benchmark problems drawn from previous work. We also show that our approach can scale to large MRF domains such as fully-connected pairwise CRF models used in computer vision.
LGJul 12, 2019
Learning Functions over Sets via Permutation Adversarial NetworksChirag Pabbaraju, Prateek Jain
In this paper, we consider the problem of learning functions over sets, i.e., functions that are invariant to permutations of input set items. Recent approaches of pooling individual element embeddings can necessitate extremely large embedding sizes for challenging functions. We address this challenge by allowing standard neural networks like LSTMs to succinctly capture the function over the set. However, to ensure invariance with respect to permutations of set elements, we propose a novel architecture called SPAN that simultaneously learns the function as well as adversarial or worst-case permutations for each input set. The learning problem reduces to a min-max optimization problem that is solved via a simple alternating block coordinate descent technique. We conduct extensive experiments on a variety of set-learning tasks and demonstrate that SPAN learns nearly permutation-invariant functions while still ensuring accuracy on test data. On a variety of tasks sampled from the domains of statistics, graph functions and linear algebra, we show that our method can significantly outperform state-of-the-art methods such as DeepSets and Janossy Pooling. Finally, we present a case study of how learning set-functions can help extract powerful features for recommendation systems, and show that such a method can be as much as 2% more accurate than carefully hand-tuned features on a real-world recommendation system.