Feras A. Saad

h-index7

17papers

365citations

Novelty61%

AI Score55

Ranked #7,933 of 194,257 authors (top 4%)#2,095 in LG (top 5%)

17 Papers

7.9DSApr 24Code

Efficient Rejection Sampling in the Entropy-Optimal Range

Thomas L. Draper, Feras A. Saad

We study the problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent fair coin flips. A classic result from Knuth and Yao shows that the optimal expected number of input coin flips per output sample lies between $H(P)$ and $H(P)\,{+}\,2$, where $H$ is the Shannon entropy function. However, implementing the Knuth and Yao ``entropy-optimal'' sampler entails a tradeoff between using either exponential space with low runtime per sample, or linear space with high runtime per sample. We introduce a new sampling algorithm that avoids this tradeoff: it requires linearithmic space, incurs negligible runtime overhead per sample, and uses an expected number of coin flips that lies in the entropy-optimal range $[H(P), H(P)\,{+}\,2)$. No previous sampler for discrete distributions simultaneously achieves these space, time, and entropy characteristics. Numerical experiments demonstrate improvements in runtime and entropy of the proposed method compared to the celebrated alias method.

20.6CCJul 16

Space-Entropy Lower Bounds for Random Sampling

Thomas L. Draper, Feras A. Saad

We prove fundamental space lower bounds for exact random sampling using an entropy source of i.i.d. uniform bits. A classic result from information theory shows that generating $n$ discrete random variables $X_1, \dots, X_n$ requires at least $H(X_1, \dots, X_n)$ input random bits on average, where $H$ is the Shannon entropy function. How much space must a random sampling algorithm use in order to approach this information-theoretically optimal entropy bound? We prove that any random sampling algorithm that is exact for arbitrary discrete target distributions and consumes at most $H(X_1,\ldots,X_n)+\varepsilon n+o(n)$ input bits in expectation for every output process must use $Ω(\log(1/\varepsilon))$ bits of space. In fact, i.i.d. sampling from the single distribution $\mathrm{Bernoulli}(1/3)$ already forces at least $(1/{5.116201}-o(1))\log(1/\varepsilon)$ bits of space. If the sampler handles a family of infinitely many Bernoulli distributions, we show a sharper bound of at least $\log(1/\varepsilon)$ bits of space. We also prove lower bounds for general i.i.d. sampling: for almost every distribution on $k$ outcomes, the space is at least $(1/(k+1)-o(1))\log(1/\varepsilon)$ bits. The proof technique is based on a graph-theoretic analysis of the amount of information that any algorithm can store in its state. Finite state spaces force short cycles around the state-transition graph, and the loss around such cycles reduces to Diophantine lower bounds on fractional parts of integer combinations of log-probabilities. To the best of our knowledge, these results comprise the first known space lower bounds for entropy-efficient random sampling.

6.9PLJul 8

GradInf: Gradient Estimation as Probabilistic Inference

Gaurav Arya, Mathieu Huot, Moritz Schauer et al.

Gradient estimation -- the task of computing the gradient of the expected value of a probabilistic program -- has diverse applications in scientific computing, but is notoriously difficult because of issues such as high-dimensional integration, discrete random choices, and complex stochastic dependencies. This article introduces gradient inference, a new approach to developing sound and efficient gradient estimators for probabilistic programs. Gradient inference rests on a formal reduction from a gradient estimation problem to a closely related probabilistic inference problem, whose solution can be differentiated to obtain a gradient estimator. This inference problem is obtained by applying two powerful statistical operations -- coupling and factorization -- to the input probabilistic program. Our reduction lets us leverage the rich toolkit of probabilistic inference algorithms to design novel gradient estimators that extend and improve upon existing methods. We introduce GradInf, a probabilistic programming system that facilitates the sound and automated implementation of gradient inference. GradInf is centered around programmable source-to-source transformations for coupling and factorizing higher-order probabilistic programs, whose soundness is proven in terms of a denotational semantics. Key to our development is the use of information-flow typing to allow random choices in a probabilistic program to be factored out and partially evaluated, which improves our ability to deploy sophisticated probabilistic inference algorithms. The resulting system offers practitioners a principled framework for designing gradient estimators. We apply GradInf to several challenging case studies, showing that it can express prominent gradient estimators from the literature and enables the construction of new state-of-the-art estimators that outperform the best existing baselines.

10.7LGJul 13, 2023Code

Sequential Monte Carlo Learning for Time Series Structure Discovery

Feras A. Saad, Brian J. Patton, Matthew D. Hoffman et al.

This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both in "online" settings, where new data is incorporated sequentially in time, and in "offline" settings, by using nested subsets of historical data to anneal the posterior. Empirical measurements on real-world time series show that our method can deliver 10x--100x runtime speedups over previous MCMC and greedy-search structure learning algorithms targeting the same model family. We use our method to perform the first large-scale evaluation of Gaussian process time series structure learning on a prominent benchmark of 1,428 econometric datasets. The results show that our method discovers sensible models that deliver more accurate point forecasts and interval forecasts over multiple horizons as compared to widely used statistical and neural baselines that struggle on this challenging data.

7.4DSMay 7Code

Efficient Online Random Sampling via Randomness Recycling

Thomas L. Draper, Feras A. Saad

This article studies the fundamental problem of using i.i.d. coin tosses from an entropy source to efficiently generate random variables $X_i \sim P_i$ $(i \ge 1)$, where $(P_1, P_2, \dots)$ is a random sequence of rational discrete probability distributions subject to an \textit{arbitrary} stochastic process. Our method achieves an amortized expected entropy cost within $\varepsilon > 0$ bits of the information-theoretically optimal Shannon lower bound using $O(\log(1/\varepsilon))$ space. This result holds both pointwise in terms of the Shannon information content conditioned on $X_i$ and $P_i$, and in expectation to obtain a rate of $\mathbb{E}[H(P_1) + \dots + H(P_n)]/n + \varepsilon$ bits per sample as $n \to \infty$ (where $H$ is the Shannon entropy). The combination of space, time, and entropy properties of our method improves upon the Knuth and Yao (1976) entropy-optimal algorithm and Han and Hoshi (1997) interval algorithm for online sampling, which require unbounded space. It also uses exponentially less space than the more specialized methods of Kozen and Soloviev (2022) and Shao and Wang (2025) that generate i.i.d. samples from a fixed distribution. Our online sampling algorithm rests on a powerful algorithmic technique called \textit{randomness recycling}, which reuses a fraction of the random information consumed by a probabilistic algorithm to reduce its amortized entropy cost. On the practical side, we develop randomness recycling techniques to accelerate a variety of prominent sampling algorithms. We show that randomness recycling enables state-of-the-art runtime performance on the Fisher-Yates shuffle when using a cryptographically secure pseudorandom number generator, and that it reduces the entropy cost of discrete Gaussian sampling. Accompanying the manuscript is a performant software library in the C programming language.

11.3DSJul 15

Online Random Sampling with Real Probabilities

Thomas L. Draper, David G. Harris, Feras A. Saad

We develop an efficient online algorithm to sample a sequence of discrete random variables using an entropy source of i.i.d. fair coin flips, in a standard model of real computation where real-valued probabilities are represented by rational approximations. For any sequence $F_1, F_2, \dots$ of probability distributions, our sampler generates $n$ outputs $X_1 \sim F_1, \dots, X_n \sim F_n$ using at most $\mathbb{E}\left[H(F_1) +\dots + H(F_n)\right] + O(\log n)$ coin flips in expectation while carrying $O(\log n)$ bits of persistent space, where $H$ is the Shannon entropy. Under standard assumptions, we prove that the space used by our sampler to achieve this information-theoretically optimal entropy rate is asymptotically optimal. The key idea is to replace the global arithmetic-decoding sampling scheme of Han and Hoshi (1997) with a local discrete uniform state, yielding an exponential reduction in space for a given entropy loss. Our approach applies to distributions with irrational probabilities and countably infinite supports, generalizing recent randomness-recycling methods beyond finite rational distributions with bounded denominator.

16.4LGMar 12, 2024Code

Scalable Spatiotemporal Prediction with Bayesian Neural Fields

Feras Saad, Jacob Burnim, Colin Carroll et al.

Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in diverse applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As the scale of modern datasets increases, there is a growing need for statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle many observations. This article introduces the Bayesian Neural Field (BayesNF), a domain-general statistical model that infers rich spatiotemporal probability distributions for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust predictive uncertainty quantification. Evaluations against prominent baselines show that BayesNF delivers improvements on prediction problems from climate and public health data containing tens to hundreds of thousands of measurements. Accompanying the paper is an open-source software package (https://github.com/google/bayesnf) that runs on GPU and TPU accelerators through the JAX machine learning platform.

13.0LGJun 19, 2025Code

Floating-Point Neural Networks Are Provably Robust Universal Approximators

Geonho Hwang, Wonyeol Lee, Yeachan Park et al.

The classical universal approximation (UA) theorem for neural networks establishes mild conditions under which a feedforward neural network can approximate a continuous function $f$ with arbitrary accuracy. A recent result shows that neural networks also enjoy a more general interval universal approximation (IUA) theorem, in the sense that the abstract interpretation semantics of the network using the interval domain can approximate the direct image map of $f$ (i.e., the result of applying $f$ to a set of inputs) with arbitrary accuracy. These theorems, however, rest on the unrealistic assumption that the neural network computes over infinitely precise real numbers, whereas their software implementations in practice compute over finite-precision floating-point numbers. An open question is whether the IUA theorem still holds in the floating-point setting. This paper introduces the first IUA theorem for floating-point neural networks that proves their remarkable ability to perfectly capture the direct image map of any rounded target function $f$, showing no limits exist on their expressiveness. Our IUA theorem in the floating-point setting exhibits material differences from the real-valued setting, which reflects the fundamental distinctions between these two computational models. This theorem also implies surprising corollaries, which include (i) the existence of provably robust floating-point neural networks; and (ii) the computational completeness of the class of straight-line programs that use only floating-point additions and multiplications for the class of all floating-point programs that halt.

5.3MLFeb 24, 2022

Estimators of Entropy and Information via Inference in Probabilistic Models

Feras A. Saad, Marco Cusumano-Towner, Vikash K. Mansinghka

Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.

4.4LGAug 16, 2021Code

Hierarchical Infinite Relational Model

Feras A. Saad, Vikash K. Mansinghka

This paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixture is then used to partition the domain entities and model the probability distribution of relation values. The HIRM generalizes the standard infinite relational model and can be used for a variety of data analysis tasks including dependence detection, clustering, and density estimation. We present new algorithms for fully Bayesian posterior inference via Gibbs sampling. We illustrate the efficacy of the method on a density estimation benchmark of twenty object-attribute datasets with up to 18 million cells and use it to discover relational structure in real-world datasets from politics and genomics.

10.8PLOct 7, 2020Code

SPPL: Probabilistic Programming with Fast Exact Symbolic Inference

Feras A. Saad, Martin C. Rinard, Vikash K. Mansinghka

We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints. We formalize SPPL via a novel translation strategy from probabilistic programs to sum-product expressions and give sound exact algorithms for conditioning on and computing probabilities of events. SPPL imposes a collection of restrictions on probabilistic programs to ensure they can be translated into sum-product expressions, which allow the system to leverage new techniques for improving the scalability of translation and inference by automatically exploiting probabilistic structure. We implement a prototype of SPPL with a modular architecture and evaluate it on benchmarks the system targets, showing that it obtains up to 3500x speedups over state-of-the-art symbolic systems on tasks such as verifying the fairness of decision tree classifiers, smoothing hidden Markov models, conditioning transformed random variables, and computing rare event probabilities.

2.3STFeb 26, 2019

A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions

Feras A. Saad, Cameron E. Freer, Nathanael L. Ackerman et al.

The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.

8.6MEOct 18, 2017Code

Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series

Feras A. Saad, Vikash K. Mansinghka

This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant process prior. Within a cluster, all time series are modeled jointly using a novel "temporally-reweighted" extension of the Chinese restaurant process mixture. Markov chain Monte Carlo techniques are used to obtain samples from the posterior distribution, which are then used to form predictive inferences. We apply the technique to challenging forecasting and imputation tasks using seasonal flu data from the US Center for Disease Control and Prevention, demonstrating superior forecasting accuracy and competitive imputation accuracy as compared to multiple widely used baselines. We further show that the model discovers interpretable clusters in datasets with hundreds of time series, using macroeconomic data from the Gapminder Foundation.

4.4AIApr 4, 2017

Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes

Feras Saad, Leonardo Casarsa, Vikash Mansinghka

Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database search operators with an information theoretic ranking function called predictive relevance. Predictive relevance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonparametric Bayesian model for high-dimensional, heterogeneously-typed data tables. The result is a flexible search technique that applies to a broad class of information retrieval problems, which we integrate into BayesDB, a probabilistic programming platform for probabilistic data analysis. This paper demonstrates applications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. We found that human evaluators often prefer the results from probabilistic search to results from a standard baseline.

5.5MLNov 21, 2016

Time Series Structure Discovery via Probabilistic Program Synthesis

Ulrich Schaechtle, Feras Saad, Alexey Radul et al.

There is a widespread need for techniques that can discover structure from time series data. Recently introduced techniques such as Automatic Bayesian Covariance Discovery (ABCD) provide a way to find structure within a single time series by searching through a space of covariance kernels that is generated using a simple grammar. While ABCD can identify a broad class of temporal patterns, it is difficult to extend and can be brittle in practice. This paper shows how to extend ABCD by formulating it in terms of probabilistic program synthesis. The key technical ideas are to (i) represent models using abstract syntax trees for a domain-specific probabilistic language, and (ii) represent the time series model prior, likelihood, and search strategy using probabilistic programs in a sufficiently expressive language. The final probabilistic program is written in under 70 lines of probabilistic code in Venture. The paper demonstrates an application to time series clustering that involves a non-parametric extension to ABCD, experiments for interpolation and extrapolation on real-world econometric data, and improvements in accuracy over both non-parametric and standard regression baselines.

7.1MLNov 5, 2016Code

Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

Feras Saad, Vikash Mansinghka

Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-parametric modeling to (i) build an ensemble of joint probability models for all the variables; (ii) efficiently detect marginal independencies; and (iii) estimate the conditional mutual information between arbitrary subsets of variables, subject to a broad class of constraints. Users can access these capabilities using BayesDB, a probabilistic programming platform for probabilistic data analysis, by writing queries in a simple, SQL-like language. This paper demonstrates empirically that the method can (i) detect context-specific (in)dependencies on challenging synthetic problems and (ii) yield improved sensitivity and specificity over baselines from statistics and machine learning, on a real-world database of over 300 sparsely observed indicators of macroeconomic development and public health.

11.3AIAug 18, 2016

Probabilistic Data Analysis with Probabilistic Programming

Feras Saad, Vikash Mansinghka

Probabilistic techniques are central to data analysis, but different approaches can be difficult to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bayesian models, multivariate kernel methods, discriminative machine learning, clustering algorithms, dimensionality reduction, and arbitrary probabilistic programs. We also demonstrate the integration of CGPMs into BayesDB, a probabilistic programming platform that can express data analysis tasks using a modeling language and a structured query language. The practical value is illustrated in two ways. First, CGPMs are used in an analysis that identifies satellite data records which probably violate Kepler's Third Law, by composing causal probabilistic programs with non-parametric Bayes in under 50 lines of probabilistic code. Second, for several representative data analysis tasks, we report on lines of code and accuracy measurements of various CGPMs, plus comparisons with standard baseline solutions from Python and MATLAB libraries.