49.4LGMay 28
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial CorruptionSantanu Das, Sagnik Chatterjee, Jatin Batra
We study the problem of robustly learning Gaussian Single Index Models (SIMs) in the presence of heavy-tailed noise and a constant fraction of adversarially corrupted covariates and responses. Prior work on robust recovery has considered settings such as linear regression (Pensia et al., JASA 2024), strictly monotonic link functions (Awasthi et al., NeurIPS 2022), and phase retrieval (Buna and Rebeschini, AISTATS 2025). However, these techniques do not extend to generic asymmetric non-monotonic link functions such as \textsc{GeLU} and \textsc{Swish}, which arise naturally as scalar primitives in modern gated neural architectures. We close this gap by giving the first robust recovery algorithm with near-linear sample and time complexity for generic non-monotonic link functions, thereby establishing the first robust recovery guarantees for a broad family of nonlinear SIMs for which \textit{no guarantees were previously known}. Our central contribution is a new structural understanding of the Gaussian squared-loss landscape under adversarial contamination. Crucially, we prove that for a broad class of nonlinear non-monotonic SIMs, a dimension-independent, constant-radius convex basin exists around the ground truth and is efficiently reachable via robust spectral initialization even under adversarial contamination. Prior works fail to establish both guarantees simultaneously, thereby either breaking down under adversarial contamination or failing to handle generic non-monotonic link functions. Together, these structural insights yield a principled warm start for robust gradient descent that provably converges to a final estimation error of $O(σ\sqrtε)$ in $\tilde{O}(nd)$ time with $\tilde{O}(d)$ samples, where $ε$ is the contamination fraction.
QUANT-PHOct 1, 2022
Efficient Quantum Agnostic Improper Learning of Decision TreesSagnik Chatterjee, Tharrmashastha SAPV, Debajyoti Bera
The agnostic setting is the hardest generalization of the PAC model since it is akin to learning with adversarial noise. In this paper, we give a poly$(n,t,{\frac{1}{\varepsilon}})$ quantum algorithm for learning size $t$ decision trees with uniform marginal over instances, in the agnostic setting, without membership queries. Our algorithm is the first algorithm (classical or quantum) for learning decision trees in polynomial time without membership queries. We show how to construct a quantum agnostic weak learner by designing a quantum version of the classical Goldreich-Levin algorithm that works with strongly biased function oracles. We show how to quantize the agnostic boosting algorithm by Kalai and Kanade (NIPS 2009) to obtain the first efficient quantum agnostic boosting algorithm. Our quantum boosting algorithm has a polynomial improvement in the dependence of the bias of the weak learner over all adaptive quantum boosting algorithms while retaining the standard speedup in the VC dimension over classical boosting algorithms. We then use our quantum boosting algorithm to boost the weak quantum learner we obtained in the previous step to obtain a quantum agnostic learner for decision trees. Using the above framework, we also give quantum decision tree learning algorithms for both the realizable setting and random classification noise model, again without membership queries.
QUANT-PHJul 6, 2023
Quantum Solutions to the Privacy vs. Utility TradeoffSagnik Chatterjee, Vyacheslav Kungurtsev
In this work, we propose a novel architecture (and several variants thereof) based on quantum cryptographic primitives with provable privacy and security guarantees regarding membership inference attacks on generative models. Our architecture can be used on top of any existing classical or quantum generative models. We argue that the use of quantum gates associated with unitary operators provides inherent advantages compared to standard Differential Privacy based techniques for establishing guaranteed security from all polynomial-time adversaries.
18.0CLApr 27
Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMsSagnik Chatterjee, Atharva Patil, Sricharan Ramesh
Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.
20.5SEMay 10
Trajectory Supervision for Continual Tool-Use Learning in LLMsVishnu Vardhan Reddy, Sagnik Chatterjee, Soumik Bhatta
Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9\% final exact full-call accuracy compared with 39.2\% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1\% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.
42.9GNMay 10
SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and ClassificationAkarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee
Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
10.1ITMar 30
Perfect Secret Key Generation for a class of Hypergraphical SourcesManuj Mukherjee, Sagnik Chatterjee, Alhad Sethi
Nitinawarat and Narayan proposed a perfect secret key generation scheme for the so-called \emph{pairwise independent network (PIN) model} by exploiting the combinatorial properties of the underlying graph, namely the spanning tree packing rate. This work considers a generalization of the PIN model where the underlying graph is replaced with a hypergraph, and makes progress towards designing similar perfect secret key generation schemes by exploiting the combinatorial properties of the hypergraph. Our contributions are two-fold. We first provide a capacity achieving scheme for a complete $t$-uniform hypergraph on $m$ vertices by leveraging a packing of the complete $t$-uniform hypergraphs by what we refer to as star hypergraphs, and designing a scheme that gives $\binom{m-2}{t-2}$ bits of perfect secret key per star graph. Our second contribution is a 2-bit perfect secret key generation scheme for 3-uniform star hypergraphs whose projections are cycles. This scheme is then extended to a perfect secret key generation scheme for generic 3-uniform hypergraphs by exploiting star graph packing of 3-uniform hypergraphs and Hamiltonian packings of graphs. The scheme is then shown to be capacity achieving for certain classes of hypergraphs.
LGMay 22, 2024
Generalization Bounds for Dependent Data using Online-to-Batch ConversionSagnik Chatterjee, Manuj Mukherjee, Alhad Sethi
In this work, we upper bound the generalization error of batch learning algorithms trained on samples drawn from a mixing stochastic process (i.e., a dependent data source) both in expectation and with high probability. Unlike previous results by Mohri et al. (2010) and Fu et al. (2023), our work does not require any stability assumptions on the batch learner, which allows us to derive upper bounds for any batch learning algorithm trained on dependent data. This is made possible due to our use of the Online-to-Batch ( OTB ) conversion framework, which allows us to shift the burden of stability from the batch learner to an artificially constructed online learner. We show that our bounds are equal to the bounds in the i.i.d. setting up to a term that depends on the decay rate of the underlying mixing stochastic process. Central to our analysis is a new notion of algorithmic stability for online learning algorithms based on Wasserstein distances of order one. Furthermore, we prove that the EWA algorithm, a textbook family of online learning algorithms, satisfies our new notion of stability. Following this, we instantiate our bounds using the EWA algorithm.
QUANT-PHFeb 1
The Quantum Learning Menagerie (A survey on Quantum learning for Classical concepts)Sagnik Chatterjee
This paper surveys various results in the field of Quantum Learning theory, specifically focusing on learning quantum-encoded classical concepts in the Probably Approximately Correct (PAC) framework. The cornerstone of this work is the emphasis on query, sample, and time complexity separations between classical and quantum learning that emerge under learning with query access to different labeling oracles. This paper aims to consolidate all known results in the area under the above umbrella and underscore the limits of our understanding by leaving the reader with 23 open problems.
QUANT-PHOct 25, 2021
Quantum Boosting using Domain-Partitioning HypothesesDebajyoti Bera, Rohan Bhatia, Parmeet Singh Chani et al.
Boosting is an ensemble learning method that converts a weak learner into a strong learner in the PAC learning framework. Freund and Schapire designed the Godel prize-winning algorithm named AdaBoost that can boost learners, which output binary hypotheses. Recently, Arunachalam and Maity presented the first quantum boosting algorithm with similar theoretical guarantees. Their algorithm, which we refer to as QAdaBoost henceforth, is a quantum adaptation of AdaBoost and only works for the binary hypothesis case. QAdaBoost is quadratically faster than AdaBoost in terms of the VC-dimension of the hypothesis class of the weak learner but polynomially worse in the bias of the weak learner. Izdebski et al. posed an open question on whether we can boost quantum weak learners that output non-binary hypothesis. In this work, we address this open question by developing the QRealBoost algorithm which was motivated by the classical RealBoost algorithm. The main technical challenge was to provide provable guarantees for convergence, generalization bounds, and quantum speedup, given that quantum subroutines are noisy and probabilistic. We prove that QRealBoost retains the quadratic speedup of QAdaBoost over AdaBoost and further achieves a polynomial speedup over QAdaBoost in terms of both the bias of the learner and the time taken by the learner to learn the target concept class. Finally, we perform empirical evaluations on QRealBoost and report encouraging observations on quantum simulators by benchmarking the convergence performance of QRealBoost against QAdaBoost, AdaBoost, and RealBoost on a subset of the MNIST dataset and Breast Cancer Wisconsin dataset.