Alex Dytso

IT
h-index17
17papers
160citations
Novelty48%
AI Score54

17 Papers

88.1ITMay 29
Functional uniqueness and stability of Gaussian priors in optimal L1 estimation

Leighton Barnes, Alex Dytso

We study when optimal Bayesian estimators under Gaussian noise are approximately linear, and what this implies about the underlying prior distribution. Consider the classical model \(Y = X + Z\), where \(Z\) is Gaussian and independent of \(X\). It is well known that under squared-error loss, the conditional mean \(\mathbb{E}[X|Y]\) is a linear function of \(Y\) if and only if the prior is Gaussian. Much less is understood under absolute-error loss, where the optimal estimator is the conditional median and standard orthogonality-based tools no longer apply. Recent work has established that, in the Gaussian noise model, the Gaussian prior is also the unique distribution that induces an exactly linear conditional median. In this paper, we move beyond exact characterizations and develop a quantitative stability theory: if the optimal estimator is approximately linear, must the prior be close to Gaussian? For the \(L_2\) setting, we derive explicit rates showing that near-linearity of the conditional mean forces the prior to be close to Gaussian in the Levy metric. For the \(L_1\) setting, we develop a functional-analytic framework based on Hermite expansions and adjoint operators, establishing that approximate linearity of the conditional median implies proximity to the Gaussian family.

33.8ITApr 16
Support Size of $\varepsilon$-Capacity-Achieving Inputs for the Amplitude-Constrained AWGN Channel

Luca Barletta, Alex Dytso

We study the amplitude-constrained additive white Gaussian noise (AWGN) channel from the perspective of near-optimal input distributions. While it is known that the capacity-achieving input is discrete with finitely many mass points, the precise scaling of its support size as a function of the amplitude constraint remains an open problem. In this work, we instead consider the minimal support size required to achieve capacity up to an $\varepsilon$-gap. We introduce the quantity $K_\varepsilon(A)$, defined as the smallest support size among discrete inputs supported on $[-A,A]$ that achieves mutual information within $\varepsilon$ of capacity. We show that this relaxed formulation is significantly more tractable and admits sharp characterizations across different regimes of $\varepsilon$. In particular, when $\varepsilon$ decays polynomially with $A$, i.e., $\varepsilon = A^{-β}$ for $β\geq 1$, we establish that $K_\varepsilon(A) = Θ(A\sqrt{\log A})$. For exponentially small gaps, we obtain bounds of order between $A\sqrt{\log A}$ and $A^{3/2}$. Our approach combines approximation-theoretic bounds for Gaussian mixtures with information-theoretic control of entropy via $χ^2$-divergence, together with a wrapping argument that relates the problem to approximating the uniform distribution on the circle. Beyond the technical results, our framework provides a conceptual explanation for the variety of scaling laws observed in prior numerical studies, showing that these correspond to different regimes of $\varepsilon$-optimality rather than intrinsic properties of the exact optimizer.

55.8ITMar 25
An Improved Lower Bound on Cardinality of Support of the Amplitude-Constrained AWGN Channel

Haiyang Wang, Luca Barletta, Alex Dytso

We study the amplitude-constrained additive white Gaussian noise channel. It is well known that the capacity-achieving input distribution for this channel is discrete and supported on finitely many points. The best known bounds show that the support size of the capacity-achieving distribution is lower-bounded by a term of order $A$ and upper-bounded by a term of order $A^2$, where $A$ denotes the amplitude constraint. It was conjectured in [1] that the linear scaling is optimal. In this work, we establish a new lower bound of order $A\sqrt{\log A}$, improving the known bound and ruling out the conjectured linear scaling. To obtain this result, we quantify the fact that the capacity-achieving output distribution is close to the uniform distribution in the interior of the amplitude constraint. Next, we introduce a wrapping operation that maps the problem to a compact domain and develop a theory of best approximation of the uniform distribution by finite Gaussian mixtures. These approximation bounds are then combined with stability properties of capacity-achieving distributions to yield the final support-size lower bound.

17.8ITApr 13
$α$-Mutual Information for the Gaussian Noise Channel

Mohammad Milanian, Alex Dytso, Martina Cardone

In this paper, we study Sibson's $α$-mutual information in the context of the additive Gaussian noise channel. While the classical case $α= 1$ is well understood and admits deep connections to estimation-theoretic quantities, such as the minimum mean-square error (MMSE) and Fisher information, many of the corresponding structural properties for general $α$ remain less explored. Our goal is to develop a systematic understanding of $α$-mutual information in the Gaussian noise setting and to identify which properties extend beyond the Shannon case. To this end, we establish several regularity properties, including finiteness conditions, continuity with respect to the signal-to-noise ratio (SNR) and the input distribution, and strict concavity/convexity properties that ensure uniqueness in associated optimization problems. A central contribution is the development of an $α$-I-MMSE relationship, generalizing the classical identity by relating the derivative of $α$-mutual information with respect to SNR to the MMSE evaluated under appropriately tilted distributions. This connection further leads to a generalized de Bruijn identity and new estimation-theoretic representations of Rényi entropy and differential Rényi entropy. We also characterize the low- and high-SNR behavior. In the low-SNR regime, the first-order behavior depends only on the input variance. In the high-SNR regime, for discrete inputs, $α$-mutual information converges to the Rényi entropy of order $1/α$, while for general inputs we connect it to $α$-information dimension. Overall, our results show that many fundamental relationships between information and estimation extend beyond the Shannon setting, in a form involving $α$-tilted distributions.

ITMar 3
Functional Properties of the Focal-Entropy

Jaimin Shah, Martina Cardone, Alex Dytso

The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.

47.0ITMay 12
An Improved Lower Bound on Support Size of Capacity-Achieving Inputs for the Binomial Channel: Extended version

Mohammadamin Baniasadi, Luca Barletta, Alex Dytso

We study the binomial channel and the structure of its capacity-achieving input and output distributions. It is known that the capacity-achieving input distribution is discrete and supported on finitely many points. The best previously known bounds show that the support size of the capacity-achieving distribution is lower-bounded by a term of order $\sqrt n$ and upper-bounded by a term of order $n/2$, where $n$ is the number of trials. In this work, we derive a new lower bound on the support size of order $\sqrt{n\log\log n}$, up to explicit constants. The proof consists of three main steps. First, we derive new upper and lower bounds on the capacity with a gap that vanishes as $n\to\infty$, which yields $C(n)=\frac12\log\frac{nπ}{2e}+o(1)$. Second, we show that the Beta-binomial output distribution induced by the reference input $X_r\sim\mathrm{Beta}(1/2,1/2)$ is asymptotically optimal: it approaches the capacity-achieving output distribution in relative entropy and, after a comparison step, in $χ^2$ divergence. Third, we prove a quantitative $χ^2$ approximation lower bound showing that this Beta-binomial output cannot be approximated too well by the output induced by a $K$-point input. Combining these ingredients forces the capacity-achieving input distribution to have at least order $\sqrt{n\log\log n}$ mass points.

16.8ITMay 8
Sub-Gaussian Concentration and Entropic Normality of the Maximum Likelihood Estimator

Leighton P. Barnes, Alex Dytso

It is well known that, under standard regularity conditions, the maximum likelihood estimator (MLE) satisfies a central limit theorem and converges in distribution to a Gaussian random variable as the sample size grows. This paper strengthens this classical result by developing several stronger forms of asymptotic normality for the normalized MLE. With additional assumptions on the score, we first establish sub-Gaussian tail bounds and convergence of all moments for the normalized estimation error. We then prove an entropic central limit theorem for a smoothed version of the estimator, showing convergence in relative entropy to the limiting Gaussian law. When the Fisher information of the normalized estimate is bounded, or its density has bounded first derivative, we further show that the smoothing can be removed, yielding entropic normality of the MLE itself. The proofs develop auxiliary tools that may be of independent interest, including exponential consistency bounds, high-moment estimates, and entropy-control arguments for the estimator.

LGJan 27, 2024
Data-Driven Estimation of the False Positive Rate of the Bayes Binary Classifier via Soft Labels

Minoh Jeong, Martina Cardone, Alex Dytso

Classification is a fundamental task in many applications on which data-driven methods have shown outstanding performances. However, it is challenging to determine whether such methods have achieved the optimal performance. This is mainly because the best achievable performance is typically unknown and hence, effectively estimating it is of prime importance. In this paper, we consider binary classification problems and we propose an estimator for the false positive rate (FPR) of the Bayes classifier, that is, the optimal classifier with respect to accuracy, from a given dataset. Our method utilizes soft labels, or real-valued labels, which are gaining significant traction thanks to their properties. We thoroughly examine various theoretical properties of our estimator, including its consistency, unbiasedness, rate of convergence, and variance. To enhance the versatility of our estimator beyond soft labels, we also consider noisy labels, which encompass binary labels. For noisy labels, we develop effective FPR estimators by leveraging a denoising technique and the Nadaraya-Watson estimator. Due to the symmetry of the problem, our results can be readily applied to estimate the false negative rate of the Bayes classifier.

MLFeb 23, 2022
A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence

Alex Dytso, Mario Goldenbaum, H. Vincent Poor et al.

A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.

ITFeb 4, 2022
Improved Information Theoretic Generalization Bounds for Distributed and Federated Learning

L. P. Barnes, Alex Dytso, H. V. Poor

We consider information-theoretic bounds on expected generalization error for statistical learning problems in a networked setting. In this setting, there are $K$ nodes, each with its own independent dataset, and the models from each node have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of $1/K$ on the number of nodes. These "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node, and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.

ITMay 3, 2021
Consistent Density Estimation Under Discrete Mixture Models

Luc Devroye, Alex Dytso

This work considers a problem of estimating a mixing probability density $f$ in the setting of discrete mixture models. The paper consists of three parts. The first part focuses on the construction of an $L_1$ consistent estimator of $f$. In particular, under the assumptions that the probability measure $μ$ of the observation is atomic, and the map from $f$ to $μ$ is bijective, it is shown that there exists an estimator $f_n$ such that for every density $f$ $\lim_{n\to \infty} \mathbb{E} \left[ \int |f_n -f | \right]=0$. The second part discusses the implementation details. Specifically, it is shown that the consistency for every $f$ can be attained with a computationally feasible estimator. The third part, as a study case, considers a Poisson mixture model. In particular, it is shown that in the Poisson noise setting, the bijection condition holds and, hence, estimation can be performed consistently for every $f$.

ITApr 5, 2021
A General Derivative Identity for the Conditional Mean Estimator in Gaussian Noise and Some Applications

Alex Dytso, H. Vincent Poor, Shlomo Shamai

Consider a channel ${\bf Y}={\bf X}+ {\bf N}$ where ${\bf X}$ is an $n$-dimensional random vector, and ${\bf N}$ is a Gaussian vector with a covariance matrix ${\bf \mathsf{K}}_{\bf N}$. The object under consideration in this paper is the conditional mean of ${\bf X}$ given ${\bf Y}={\bf y}$, that is ${\bf y} \to E[{\bf X}|{\bf Y}={\bf y}]$. Several identities in the literature connect $E[{\bf X}|{\bf Y}={\bf y}]$ to other quantities such as the conditional variance, score functions, and higher-order conditional moments. The objective of this paper is to provide a unifying view of these identities. In the first part of the paper, a general derivative identity for the conditional mean is derived. Specifically, for the Markov chain ${\bf U} \leftrightarrow {\bf X} \leftrightarrow {\bf Y}$, it is shown that the Jacobian of $E[{\bf U}|{\bf Y}={\bf y}]$ is given by ${\bf \mathsf{K}}_{\bf N}^{-1} {\bf Cov} ( {\bf X}, {\bf U} | {\bf Y}={\bf y})$. In the second part of the paper, via various choices of ${\bf U}$, the new identity is used to generalize many of the known identities and derive some new ones. First, a simple proof of the Hatsel and Nolte identity for the conditional variance is shown. Second, a simple proof of the recursive identity due to Jaffer is provided. Third, a new connection between the conditional cumulants and the conditional expectation is shown. In particular, it is shown that the $k$-th derivative of $E[X|Y=y]$ is the $(k+1)$-th conditional cumulant. The third part of the paper considers some applications. In a first application, the power series and the compositional inverse of $E[X|Y=y]$ are derived. In a second application, the distribution of the estimator error $(X-E[X|Y])$ is derived. In a third application, we construct consistent estimators (empirical Bayes estimators) of the conditional cumulants from an i.i.d. sequence $Y_1,...,Y_n$.

ITMay 7, 2020
Nonparametric Estimation of the Fisher Information and Its Applications

Wei Cao, Alex Dytso, Michael Fauß et al.

This paper considers the problem of estimation of the Fisher information for location from a random sample of size $n$. First, an estimator proposed by Bhattacharya is revisited and improved convergence rates are derived. Second, a new estimator, termed a clipped estimator, is proposed. Superior upper bounds on the rates of convergence can be shown for the new estimator compared to the Bhattacharya estimator, albeit with different regularity conditions. Third, both of the estimators are evaluated for the practically relevant case of a random variable contaminated by Gaussian noise. Moreover, using Brown's identity, which relates the Fisher information and the minimum mean squared error (MMSE) in Gaussian noise, two corresponding consistent estimators for the MMSE are proposed. Simulation examples for the Bhattacharya estimator and the clipped estimator as well as the MMSE estimators are presented. The examples demonstrate that the clipped estimator can significantly reduce the required sample size to guarantee a specific confidence interval compared to the Bhattacharya estimator.

LGMay 5, 2020
Information-Theoretic Bounds on the Generalization Error and Privacy Leakage in Federated Learning

Semih Yagli, Alex Dytso, H. Vincent Poor

Machine learning algorithms operating on mobile networks can be characterized into three different categories. First is the classical situation in which the end-user devices send their data to a central server where this data is used to train a model. Second is the distributed setting in which each device trains its own model and send its model parameters to a central server where these model parameters are aggregated to create one final model. Third is the federated learning setting in which, at any given time $t$, a certain number of active end users train with their own local data along with feedback provided by the central server and then send their newly estimated model parameters to the central server. The server, then, aggregates these new parameters, updates its own model, and feeds the updated parameters back to all the end users, continuing this process until it converges. The main objective of this work is to provide an information-theoretic framework for all of the aforementioned learning paradigms. Moreover, using the provided framework, we develop upper and lower bounds on the generalization error together with bounds on the privacy leakage in the classical, distributed and federated learning settings. Keywords: Federated Learning, Distributed Learning, Machine Learning, Model Aggregation.

ITMar 19, 2020
The Vector Poisson Channel: On the Linearity of the Conditional Mean Estimator

Alex Dytso, Michael Fauss, H. Vincent Poor

This work studies properties of the conditional mean estimator in vector Poisson noise. The main emphasis is to study conditions on prior distributions that induce linearity of the conditional mean estimator. The paper consists of two main results. The first result shows that the only distribution that induces the linearity of the conditional mean estimator is a product gamma distribution. Moreover, it is shown that the conditional mean estimator cannot be linear when the dark current parameter of the Poisson noise is non-zero. The second result produces a quantitative refinement of the first result. Specifically, it is shown that if the conditional mean estimator is close to linear in a mean squared error sense, then the prior distribution must be close to a product gamma distribution in terms of their characteristic functions. Finally, the results are compared to their Gaussian counterparts.

LGFeb 26, 2018
A Differential Privacy Mechanism Design Under Matrix-Valued Query

Thee Chanyaswad, Alex Dytso, H. Vincent Poor et al.

Traditionally, differential privacy mechanism design has been tailored for a scalar-valued query function. Although many mechanisms such as the Laplace and Gaussian mechanisms can be extended to a matrix-valued query function by adding i.i.d. noise to each element of the matrix, this method is often sub-optimal as it forfeits an opportunity to exploit the structural characteristics typically associated with matrix analysis. In this work, we consider the design of differential privacy mechanism specifically for a matrix-valued query function. The proposed solution is to utilize a matrix-variate noise, as opposed to the traditional scalar-valued noise. Particularly, we propose a novel differential privacy mechanism called the Matrix-Variate Gaussian (MVG) mechanism, which adds a matrix-valued noise drawn from a matrix-variate Gaussian distribution. We prove that the MVG mechanism preserves $(ε,δ)$-differential privacy, and show that it allows the structural characteristics of the matrix-valued query function to naturally be exploited. Furthermore, due to the multi-dimensional nature of the MVG mechanism and the matrix-valued query, we introduce the concept of directional noise, which can be utilized to mitigate the impact the noise has on the utility of the query. Finally, we demonstrate the performance of the MVG mechanism and the advantages of directional noise using three matrix-valued queries on three privacy-sensitive datasets. We find that the MVG mechanism notably outperforms four previous state-of-the-art approaches, and provides comparable utility to the non-private baseline. Our work thus presents a promising prospect for both future research and implementation of differential privacy for matrix-valued query functions.

CRJan 2, 2018
MVG Mechanism: Differential Privacy under Matrix-Valued Query

Thee Chanyaswad, Alex Dytso, H. Vincent Poor et al.

Differential privacy mechanism design has traditionally been tailored for a scalar-valued query function. Although many mechanisms such as the Laplace and Gaussian mechanisms can be extended to a matrix-valued query function by adding i.i.d. noise to each element of the matrix, this method is often suboptimal as it forfeits an opportunity to exploit the structural characteristics typically associated with matrix analysis. To address this challenge, we propose a novel differential privacy mechanism called the Matrix-Variate Gaussian (MVG) mechanism, which adds a matrix-valued noise drawn from a matrix-variate Gaussian distribution, and we rigorously prove that the MVG mechanism preserves $(ε,δ)$-differential privacy. Furthermore, we introduce the concept of directional noise made possible by the design of the MVG mechanism. Directional noise allows the impact of the noise on the utility of the matrix-valued query function to be moderated. Finally, we experimentally demonstrate the performance of our mechanism using three matrix-valued queries on three privacy-sensitive datasets. We find that the MVG mechanism notably outperforms four previous state-of-the-art approaches, and provides comparable utility to the non-private baseline.