Anant Sahai

LG
h-index39
18papers
427citations
Novelty42%
AI Score50

18 Papers

ITJan 23, 2017
Real-time Cooperative Communication for Automation over Wireless

Vasuki Narasimha Swamy, Sahaana Suri, Paul Rigge et al. · berkeley

High-performance industrial automation systems rely on tens of simultaneously active sensors and actuators and have stringent communication latency and reliability requirements. Current wireless technologies like WiFi, Bluetooth, and LTE are unable to meet these requirements, forcing the use of wired communication in industrial control systems. This paper introduces a wireless communication protocol that capitalizes on multiuser diversity and cooperative communication to achieve the ultra-reliability with a low-latency constraint. Our protocol is analyzed using the communication-theoretic delay-limited-capacity framework and compared to baseline schemes that primarily exploit frequency diversity. For a scenario inspired by an industrial printing application with thirty nodes in the control loop, 20B messages transmitted between pairs of nodes and a cycle time of $2$ ms, an idealized protocol can achieve a cycle failure probability (probability that any packet in a cycle is not successfully delivered) lower than $10^{-9}$ with nominal SNR below 5 dB in a 20MHz wide channel.

ITOct 23, 2010
Implicit and explicit communication in decentralized control

Pulkit Grover, Anant Sahai · berkeley

There has been substantial progress recently in understanding toy problems of purely implicit signaling. These are problems where the source and the channel are implicit -- the message is generated endogenously by the system, and the plant itself is used as a channel. In this paper, we explore how implicit and explicit communication can be used synergistically to reduce control costs. The setting is an extension of Witsenhausen's counterexample where a rate-limited external channel connects the two controllers. Using a semi-deterministic version of the problem, we arrive at a binning-based strategy that can outperform the best known strategies by an arbitrarily large factor. We also show that our binning-based strategy attains within a constant factor of the optimal cost for an asymptotically infinite-length version of the problem uniformly over all problem parameters and all rates on the external channel. For the scalar case, although our results yield approximate optimality for each fixed rate, we are unable to prove approximately-optimality uniformly over all rates.

LGJun 3, 2022
Generalization for multiclass classification with overparameterized linear models

Vignesh Subramanian, Rahul Arya, Anant Sahai · berkeley

Via an overparameterized linear model with Gaussian features, we provide conditions for good generalization for multiclass classification of minimum-norm interpolating solutions in an asymptotic setting where both the number of underlying features and the number of classes scale with the number of training points. The survival/contamination analysis framework for understanding the behavior of overparameterized learning problems is adapted to this setting, revealing that multiclass classification qualitatively behaves like binary classification in that, as long as there are not too many classes (made precise in the paper), it is possible to generalize well even in some settings where the corresponding regression tasks would not generalize. Besides various technical challenges, it turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclass problem "harder" than the binary one.

ITSep 13, 2010
Is Witsenhausen's counterexample a relevant toy?

Pulkit Grover, Anant Sahai · berkeley

This paper answers a question raised by Doyle on the relevance of the Witsenhausen counterexample as a toy decentralized control problem. The question has two sides, the first of which focuses on the lack of an external channel in the counterexample. Using existing results, we argue that the core difficulty in the counterexample is retained even in the presence of such a channel. The second side questions the LQG formulation of the counterexample. We consider alternative formulations and show that the understanding developed for the LQG case guides the investigation for these other cases as well. Specifically, we consider 1) a variation on the original counterexample with general, but bounded, noise distributions, and 2) an adversarial extension with bounded disturbance and quadratic costs. For each of these formulations, we show that quantization-based nonlinear strategies outperform linear strategies by an arbitrarily large factor. Further, these nonlinear strategies also perform within a constant factor of the optimal, uniformly over all possible parameter choices (for fixed noise distributions in the Bayesian case). Fortuitously, the assumption of bounded noise results in a significant simplification of proofs as compared to those for the LQG formulation. Therefore, the results in this paper are also of pedagogical interest.

LGJun 23, 2023
Precise Asymptotic Generalization for Multiclass Classification with Overparameterized Linear Models

David X. Wu, Anant Sahai · berkeley

We study the asymptotic generalization of an overparameterized linear model for multiclass classification under the Gaussian covariates bi-level model introduced in Subramanian et al.~'22, where the number of data points, features, and classes all grow together. We fully resolve the conjecture posed in Subramanian et al.~'22, matching the predicted regimes for generalization. Furthermore, our new lower bounds are akin to an information-theoretic strong converse: they establish that the misclassification rate goes to 0 or 1 asymptotically. One surprising consequence of our tight results is that the min-norm interpolating classifier can be asymptotically suboptimal relative to noninterpolating classifiers in the regime where the min-norm interpolating regressor is known to be optimal. The key to our tight analysis is a new variant of the Hanson-Wright inequality which is broadly useful for multiclass problems with sparse labels. As an application, we show that the same type of analysis can be used to analyze the related multilabel classification problem under the same bi-level ensemble.

ITJan 16, 2017
Control Capacity

Gireeja Ranade, Anant Sahai · berkeley

Feedback control actively dissipates uncertainty from a dynamical system by means of actuation. We develop a notion of "control capacity" that gives a fundamental limit (in bits) on the rate at which a controller can dissipate the uncertainty from a system, i.e. stabilize to a known fixed point. We give a computable single-letter characterization of control capacity for memoryless stationary scalar multiplicative actuation channels. Control capacity allows us to answer questions of stabilizability for scalar linear systems: a system with actuation uncertainty is stabilizable if and only if the control capacity is larger than the log of the unstable open-loop eigenvalue. For second-moment senses of stability, we recover the classic uncertainty threshold principle result. However, our definition of control capacity can quantify the stabilizability limits for any moment of stability. Our formulation parallels the notion of Shannon's communication capacity, and thus yields both a strong converse and a way to compute the value of side-information in control. The results in our paper are motivated by bit-level models for control that build on the deterministic models that are widely used to understand information flows in wireless network information theory.

AIDec 2, 2025
Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

David X. Wu, Shreyas Kapur, Anant Sahai et al.

Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.

STAT-MECHMay 18
The Thermodynamic Costs of Simple Linear Regression

Samuel H. D'Ambrosia, Sultan M. Daniels, Michael R. DeWeese et al.

The construction of models from data is a significant contributor to the energetic costs of computation. Because of this, understanding how foundational thermodynamic bounds apply to modeling algorithms will be increasingly important. Here, we study the thermodynamic costs of a basic and fundamental modeling algorithm: simple linear regression. Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers. From this, we derive energycost aware scaling laws for the optimal dataset size for training a linear regression model given a generalization error dependent demand for inference. Additionally, we discuss a method to lower bound the entropy production from the mismatch cost for algorithms with continuous input variables.

ITMay 12
The Entropy of Floating-Point Numbers

Sultan Daniels, Samuel H. D'Ambrosia, Michael R. DeWeese et al.

Here we present an analytic approximation for the entropy of floating-point numbers, along with bounds on the error of this approximation. It is well-known that the differential entropy is tightly linked to the discrete entropy of a uniformly quantized random variable. Our approximation uncovers a different quantity that provides this link for floating-point quantization. Additionally, we prove that the entropy of a floating-point quantized random variable is approximately unchanged under scaling. Closed-form expressions for the floating-point entropy of common distributions are provided and compared to exact results.

LGJul 27, 2024
Polynomial Regression as a Task for Understanding In-context Learning Through Finetuning and Alignment

Max Wilcoxson, Morten Svendgård, Ria Doshi et al.

Simple function classes have emerged as toy problems to better understand in-context-learning in transformer-based architectures used for large language models. But previously proposed simple function classes like linear regression or multi-layer-perceptrons lack the structure required to explore things like prompting and alignment within models capable of in-context-learning. We propose univariate polynomial regression as a function class that is just rich enough to study prompting and alignment, while allowing us to visualize and understand what is going on clearly.

LGJul 2, 2025
Decomposing Prediction Mechanisms for In-Context Recall

Sultan Daniels, Dylan Davis, Dhruv Gautam et al. · berkeley

We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

LGNov 6, 2024
Can Custom Models Learn In-Context? An Exploration of Hybrid Architecture Performance on In-Context Learning Tasks

Ryan Campbell, Nelson Lojo, Kesava Viswanadha et al.

In-Context Learning (ICL) is a phenomenon where task learning occurs through a prompt sequence without the necessity of parameter updates. ICL in Multi-Headed Attention (MHA) with absolute positional embedding has been the focus of more study than other sequence model varieties. We examine implications of architectural differences between GPT-2 and LLaMa as well as LlaMa and Mamba. We extend work done by Garg et al. (2022) and Park et al. (2024) to GPT-2/LLaMa hybrid and LLaMa/Mamba hybrid models - examining the interplay between sequence transformation blocks and regressive performance in-context. We note that certain architectural changes cause degraded training efficiency/ICL accuracy by converging to suboptimal predictors or converging slower. We also find certain hybrids showing optimistic performance improvements, informing potential future ICL-focused architecture modifications. Additionally, we propose the "ICL regression score", a scalar metric describing a model's whole performance on a specific task. Compute limitations impose restrictions on our architecture-space, training duration, number of training runs, function class complexity, and benchmark complexity. To foster reproducible and extensible research, we provide a typed, modular, and extensible Python package on which we run all experiments.

LGSep 27, 2021
Classification and Adversarial examples in an Overparameterized Linear Model: A Signal Processing Perspective

Adhyyan Narang, Vidya Muthukumar, Anant Sahai

State-of-the-art deep learning classifiers are heavily overparameterized with respect to the amount of training examples and observed to generalize well on "clean" data, but be highly susceptible to infinitesmal adversarial perturbations. In this paper, we identify an overparameterized linear ensemble, that uses the "lifted" Fourier feature map, that demonstrates both of these behaviors. The input is one-dimensional, and the adversary is only allowed to perturb these inputs and not the non-linear features directly. We find that the learned model is susceptible to adversaries in an intermediate regime where classification generalizes but regression does not. Notably, the susceptibility arises despite the absence of model mis-specification or label noise, which are commonly cited reasons for adversarial-susceptibility. These results are extended theoretically to a random-Fourier-sum setup that exhibits double-descent behavior. In both feature-setups, the adversarial vulnerability arises because of a phenomenon we term spatial localization: the predictions of the learned model are markedly more sensitive in the vicinity of training points than elsewhere. This sensitivity is a consequence of feature lifting and is reminiscent of Gibb's and Runge's phenomena from signal processing and functional analysis. Despite the adversarial susceptibility, we find that classification with these features can be easier than the more commonly studied "independent feature" models.

GTDec 3, 2020
On the Impossibility of Convergence of Mixed Strategies with No Regret Learning

Vidya Muthukumar, Soham Phade, Anant Sahai

We study the limiting behavior of the mixed strategies that result from optimal no-regret learning strategies in a repeated game setting where the stage game is any 2 by 2 competitive game. We consider optimal no-regret algorithms that are mean-based and monotonic in their argument. We show that for any such algorithm, the limiting mixed strategies of the players cannot converge almost surely to any Nash equilibrium. This negative result is also shown to hold under a broad relaxation of these assumptions, including popular variants of Online-Mirror-Descent with optimism and/or adaptive step-sizes. Finally, we conjecture that the monotonicity assumption can be removed, and provide partial evidence for this conjecture. Our results identify the inherent stochasticity in players' realizations as a critical factor underlying this divergence in outcomes between using the opponent's mixtures and realizations to make updates.

LGMay 16, 2020
Classification vs regression in overparameterized regimes: Does the loss function matter?

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian et al.

We compare classification and regression tasks in an overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that minimizes the hinge loss, typically used for training classifiers. On the other hand, we show that there exist regimes where these interpolating solutions generalize well when evaluated by the 0-1 test loss function, but do not generalize if evaluated by the square loss function, i.e. they approach the null risk. Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).

LGMar 21, 2019
Harmless interpolation of noisy data in regression

Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian et al.

A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We characterize the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this error decays to zero with the number of features. Thus, overparameterization can be explicitly beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature -- signal "bleeding" into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions.

LGMay 22, 2018
Best of many worlds: Robust model selection for online supervised learning

Vidya Muthukumar, Mitas Ray, Anant Sahai et al.

We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to $\mathcal{O}(\sqrt{T})$ regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity.

SPJan 14, 2018
Cooperative Multi-Agent Reinforcement Learning for Low-Level Wireless Communication

Colin de Vrieze, Shane Barratt, Daniel Tsai et al.

Traditional radio systems are strictly co-designed on the lower levels of the OSI stack for compatibility and efficiency. Although this has enabled the success of radio communications, it has also introduced lengthy standardization processes and imposed static allocation of the radio spectrum. Various initiatives have been undertaken by the research community to tackle the problem of artificial spectrum scarcity by both making frequency allocation more dynamic and building flexible radios to replace the static ones. There is reason to believe that just as computer vision and control have been overhauled by the introduction of machine learning, wireless communication can also be improved by utilizing similar techniques to increase the flexibility of wireless networks. In this work, we pose the problem of discovering low-level wireless communication schemes ex-nihilo between two agents in a fully decentralized fashion as a reinforcement learning problem. Our proposed approach uses policy gradients to learn an optimal bi-directional communication scheme and shows surprisingly sophisticated and intelligent learning behavior. We present the results of extensive experiments and an analysis of the fidelity of our approach.