Andrew Thangaraj

IT
h-index38
5papers
24citations
Novelty54%
AI Score45

5 Papers

ITApr 21
Finite-Length Empirical Comparison of Polar, PAC, and Invertible-Extractor Secrecy Codes over the Wiretap BSC

Jaswanthi Mandalapu, Andrew Thangaraj

We compare three secrecy-coding schemes for the degraded wiretap binary symmetric channel (BSC) in the finite-blocklength regime: (i) polar wiretap coset codes, (ii) PAC codes used as wiretap coset codes, and (iii) the invertible-extractor (IE) framework of Bellare-Tessaro. Our comparison is empirical and uses a common semantic-secrecy metric (distinguishing advantage). For polar coset codes, we compute Eve's polarized bit-channel capacities (via the Tal-Vardy construction) to obtain explicit finite-length upper bounds on mutual-information leakage, yielding strong secrecy bounds. For PAC coset codes, we prove that Eve's synthesized bit-channels are equivalent to those of polar codes (up to a permutation), so the same leakage bounds apply; we then convert these strong-secrecy bounds into semantic-secrecy guarantees for symmetric wiretap channels. For the IE scheme, we use the closed-form semantic-secrecy bounds given in the reference work. Finally, we report finite-length results that jointly characterize (a) semantic-secrecy guarantees against Eve and (b) frame-error-rate performance at Bob, illustrating that PAC codes can significantly improve reliability without changing the secrecy bounds inherited from polar coding. Moreover, under the finite-length bounds considered in this work, polar/PAC secrecy codes provide tighter security guarantees than the invertible-extractor framework.

ITMay 8
Semantic Smoothing for Language Models via Distribution Estimation and Embeddings

Haricharan Balasundaram, Swathi Shree Narashiman, Pranay Mathur et al.

We propose semantic smoothing, a smoothing method for language models that uses embeddings to share statistical observations across semantically similar contexts. The starting point is a decomposition of log-perplexity that motivates smoothing as a collection of distribution-estimation problems under Kullback-Leibler (KL) loss. We then show that, under a Lipschitz-logit model for embedding-based language generation, proximity of context embeddings implies proximity of the corresponding next-word distributions in KL divergence. Combining these observations, we formulate semantic smoothing as distribution estimation in KL loss with KL-proximity side information. For $n$ samples on a $d$-symbol alphabet with a side-information distribution at KL distance $Δ$, we give an interpolation estimator with worst-case KL risk $O(\min\{Δ,d/n\})$, and prove a matching-order lower bound for uniform side information. We extend the estimator to multiple and empirically estimated synonymous distributions. Experiments on synthetic Markov data and WikiText-103 bigram models using Word2Vec, GloVe, and GPT-2 embeddings show that semantic smoothing consistently reduces test perplexity when applied to add-constant and Kneser-Ney estimates.

LGDec 23, 2024
Rate of Model Collapse in Recursive Training

Ananda Theertha Suresh, Andrew Thangaraj, Aditya Nanda Kishore Khandavally

Given the ease of creating synthetic data from machine learning models, new models can be potentially trained on synthetic data generated by previous models. This recursive training process raises concerns about the long-term impact on model quality. As models are recursively trained on generated data from previous rounds, their ability to capture the nuances of the original human-generated data may degrade. This is often referred to as \emph{model collapse}. In this work, we ask how fast model collapse occurs for some well-studied distribution families under maximum likelihood (ML or near ML) estimation during recursive training. Surprisingly, even for fundamental distributions such as discrete and Gaussian distributions, the exact rate of model collapse is unknown. In this work, we theoretically characterize the rate of collapse in these fundamental settings and complement it with experimental evaluations. Our results show that for discrete distributions, the time to forget a word is approximately linearly dependent on the number of times it occurred in the original corpus, and for Gaussian models, the standard deviation reduces to zero roughly at $n$ iterations, where $n$ is the number of samples at each iteration. Both of these findings imply that model forgetting, at least in these simple distributions under near ML estimation with many samples, takes a long time.

MLApr 8, 2024
Just Wing It: Near-Optimal Estimation of Missing Mass in a Markovian Sequence

Ashwin Pananjady, Vidya Muthukumar, Andrew Thangaraj

We study the problem of estimating the stationary mass -- also called the unigram mass -- that is missing from a single trajectory of a discrete-time, ergodic Markov chain. This problem has several applications -- for example, estimating the stationary missing mass is critical for accurately smoothing probability estimates in sequence models. While the classical Good--Turing estimator from the 1950s has appealing properties for i.i.d. data, it is known to be biased in the Markovian setting, and other heuristic estimators do not come equipped with guarantees. Operating in the general setting in which the size of the state space may be much larger than the length $n$ of the trajectory, we develop a linear-runtime estimator called Windowed Good--Turing (WingIt) and show that its risk decays as $\widetilde{O}(\mathsf{T_{mix}}/n)$, where $\mathsf{T_{mix}}$ denotes the mixing time of the chain in total variation distance. Notably, this rate is independent of the size of the state space and minimax-optimal up to a logarithmic factor in $n / \mathsf{T_{mix}}$. We also present an upper bound on the variance of the missing mass random variable, which may be of independent interest. We extend our estimator to approximate the stationary mass placed on elements occurring with small frequency in the trajectory. Finally, we demonstrate the efficacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text.

ITFeb 3, 2021
How good is Good-Turing for Markov samples?

Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman

The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for missing stationary mass (i.e., total stationary probability of missing symbols) of Markov samples on an alphabet $\mathcal{X}$ with stationary distribution $[π_x:x \in \mathcal{X}]$ and transition probability matrix (t.p.m.) $P$. This is an important and interesting problem because GT is widely used in applications with temporal dependencies such as language models assigning probabilities to word sequences, which are modelled as Markov. We show that convergence of GT depends on convergence of $(P^{\sim x})^n$, where $P^{\sim x}$ is $P$ with the $x$-th column zeroed out. This, in turn, depends on the Perron eigenvalue $λ^{\sim x}$ of $P^{\sim x}$ and its relationship with $π_x$ uniformly over $x$. For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-$x$ relationships between $λ^{\sim x}$ and $π_x$. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap $β$, we show minimax rate upper and lower bounds of $1/(nβ^5)$ and $1/(nβ)$, respectively, for the estimation of stationary missing mass. This theoretical result extends the $1/n$ minimax rate for i.i.d. or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples.