Aayush Mishra

LG
h-index74
18papers
108citations
Novelty40%
AI Score52

18 Papers

CLOct 12, 2023
Do pretrained Transformers Learn In-Context by Gradient Descent?

Lingfeng Shen, Aayush Mishra, Daniel Khashabi

The emergence of In-Context Learning (ICL) in LLMs remains a remarkable phenomenon that is partially understood. To explain ICL, recent studies have created theoretical connections to Gradient Descent (GD). We ask, do such connections hold up in actual pre-trained language models? We highlight the limiting assumptions in prior works that make their setup considerably different from the practical setup in which language models are trained. For example, their experimental verification uses \emph{ICL objective} (training models explicitly for ICL), which differs from the emergent ICL in the wild. Furthermore, the theoretical hand-constructed weights used in these studies have properties that don't match those of real LLMs. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that \emph{the equivalence between ICL and GD remains an open hypothesis} and calls for further studies.

CVNov 3, 2025Code
Privacy Preserving Ordinal-Meta Learning with VLMs for Fine-Grained Fruit Quality Prediction

Riddhi Jain, Manasi Patwardhan, Aayush Mishra et al.

To effectively manage the wastage of perishable fruits, it is crucial to accurately predict their freshness or shelf life using non-invasive methods that rely on visual data. In this regard, deep learning techniques can offer a viable solution. However, obtaining fine-grained fruit freshness labels from experts is costly, leading to a scarcity of data. Closed proprietary Vision Language Models (VLMs), such as Gemini, have demonstrated strong performance in fruit freshness detection task in both zero-shot and few-shot settings. Nonetheless, food retail organizations are unable to utilize these proprietary models due to concerns related to data privacy, while existing open-source VLMs yield sub-optimal performance for the task. Fine-tuning these open-source models with limited data fails to achieve the performance levels of proprietary models. In this work, we introduce a Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm, designed to train smaller VLMs. This approach utilizes meta-learning to address data sparsity and leverages label ordinality, thereby achieving state-of-the-art performance in the fruit freshness classification task under both zero-shot and few-shot settings. Our method achieves an industry-standard accuracy of 92.71%, averaged across all fruits. Keywords: Fruit Quality Prediction, Vision Language Models, Meta Learning, Ordinal Regression

COFeb 6
BayesFlow 2.0: Multi-Backend Amortized Bayesian Inference in Python

Lars Kühmichel, Jerry M. Huang, Valentin Pratz et al.

Modern Bayesian inference involves a mixture of computational methods for estimating, validating, and drawing conclusions from probabilistic models as part of principled workflows. An overarching motif of many Bayesian methods is that they are relatively slow, which often becomes prohibitive when fitting complex models to large data sets. Amortized Bayesian inference (ABI) offers a path to solving the computational challenges of Bayes. ABI trains neural networks on model simulations, rewarding users with rapid inference of any model-implied quantity, such as point estimates, likelihoods, or full posterior distributions. In this work, we present the Python library BayesFlow, Version 2.0, for general-purpose ABI. Along with direct posterior, likelihood, and ratio estimation, the software includes support for multiple popular deep learning backends, a rich collection of generative networks for sampling and density estimation, complete customization and high-level interfaces, as well as new capabilities for hyperparameter optimization, design optimization, and hierarchical modeling. Using a case study on dynamical system parameter estimation, combined with comparisons to similar software, we show that our streamlined, user-friendly workflow has strong potential to support broad adoption.

LGSep 6, 2023
DECODE: Data-driven Energy Consumption Prediction leveraging Historical Data and Environmental Factors in Buildings

Aditya Mishra, Haroon R. Lone, Aayush Mishra

Energy prediction in buildings plays a crucial role in effective energy management. Precise predictions are essential for achieving optimal energy consumption and distribution within the grid. This paper introduces a Long Short-Term Memory (LSTM) model designed to forecast building energy consumption using historical energy data, occupancy patterns, and weather conditions. The LSTM model provides accurate short, medium, and long-term energy predictions for residential and commercial buildings compared to existing prediction models. We compare our LSTM model with established prediction methods, including linear regression, decision trees, and random forest. Encouragingly, the proposed LSTM model emerges as the superior performer across all metrics. It demonstrates exceptional prediction accuracy, boasting the highest R2 score of 0.97 and the most favorable mean absolute error (MAE) of 0.007. An additional advantage of our developed model is its capacity to achieve efficient energy consumption forecasts even when trained on a limited dataset. We address concerns about overfitting (variance) and underfitting (bias) through rigorous training and evaluation on real-world data. In summary, our research contributes to energy prediction by offering a robust LSTM model that outperforms alternative methods and operates with remarkable efficiency, generalizability, and reliability.

MLFeb 26
Unsupervised Continual Learning for Amortized Bayesian Inference

Aayush Mishra, Šimon Kucharský, Paul-Christian Bürkner

Amortized Bayesian Inference (ABI) enables efficient posterior estimation using generative neural networks trained on simulated data, but often suffers from performance degradation under model misspecification. While self-consistency (SC) training on unlabeled empirical data can enhance network robustness, current approaches are limited to static, single-task settings and fail to handle sequentially arriving data or distribution shifts. We propose a continual learning framework for ABI that decouples simulation-based pre-training from unsupervised sequential SC fine-tuning on real-world data. To address the challenge of catastrophic forgetting, we introduce two adaptation strategies: (1) SC with episodic replay, utilizing a memory buffer of past observations, and (2) SC with elastic weight consolidation, which regularizes updates to preserve task-critical parameters. Across three diverse case studies, our methods significantly mitigate forgetting and yield posterior estimates that outperform standard simulation-based training, achieving estimates closer to MCMC reference, providing a viable path for trustworthy ABI across a range of different tasks.

AIApr 10
Steered LLM Activations are Non-Surjective

Aayush Mishra, Daniel Khashabi, Anqi Liu

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

LGJul 26, 2022
Repeated Environment Inference for Invariant Learning

Aayush Mishra, Anqi Liu

We study the problem of invariant learning when the environment labels are unknown. We focus on the invariant representation notion when the Bayes optimal conditional label distribution is the same across different environments. Previous work conducts Environment Inference (EI) by maximizing the penalty term from Invariant Risk Minimization (IRM) framework. The EI step uses a reference model which focuses on spurious correlations to efficiently reach a good environment partition. However, it is not clear how to find such a reference model. In this work, we propose to repeat the EI process and retrain an ERM model on the \textit{majority} environment inferred by the previous EI step. Under mild assumptions, we find that this iterative process helps learn a representation capturing the spurious correlation better than the single step. This results in better Environment Inference and better Invariant Learning. We show that this method outperforms baselines on both synthetic and real-world datasets.

CLSep 28, 2023
Stress Testing Chain-of-Thought Prompting for Large Language Models

Aayush Mishra, Karan Thakkar

This report examines the effectiveness of Chain-of-Thought (CoT) prompting in improving the multi-step reasoning abilities of large language models (LLMs). Inspired by previous studies \cite{Min2022RethinkingWork}, we analyze the impact of three types of CoT prompt perturbations, namely CoT order, CoT values, and CoT operators on the performance of GPT-3 on various tasks. Our findings show that incorrect CoT prompting leads to poor performance on accuracy metrics. Correct values in the CoT is crucial for predicting correct answers. Moreover, incorrect demonstrations, where the CoT operators or the CoT order are wrong, do not affect the performance as drastically when compared to the value based perturbations. This research deepens our understanding of CoT prompting and opens some new questions regarding the capability of LLMs to learn reasoning in context.

LGJun 17, 2025Code
ODD: Overlap-aware Estimation of Model Performance under Distribution Shift

Aayush Mishra, Anqi Liu

Reliable and accurate estimation of the error of an ML model in unseen test domains is an important problem for safe intelligent systems. Prior work uses disagreement discrepancy (DIS^2) to derive practical error bounds under distribution shifts. It optimizes for a maximally disagreeing classifier on the target domain to bound the error of a given source classifier. Although this approach offers a reliable and competitively accurate estimate of the target error, we identify a problem in this approach which causes the disagreement discrepancy objective to compete in the overlapping region between source and target domains. With an intuitive assumption that the target disagreement should be no more than the source disagreement in the overlapping region due to high enough support, we devise Overlap-aware Disagreement Discrepancy (ODD). Maximizing ODD only requires disagreement in the non-overlapping target domain, removing the competition. Our ODD-based bound uses domain-classifiers to estimate domain-overlap and better predicts target performance than DIS^2. We conduct experiments on a wide array of benchmarks to show that our method improves the overall performance-estimation error while remaining valid and reliable. Our code and results are available on GitHub.

MLJan 23, 2025
Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

Aayush Mishra, Daniel Habermann, Marvin Schmitt et al.

Amortized Bayesian inference (ABI) with neural networks can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, ABI is not yet sufficiently robust for widespread and safe application. When performing inference on observations outside the scope of the simulated training data, posterior approximations are likely to become highly biased, which cannot be corrected by additional simulations due to the bad pre-asymptotic behavior of current neural posterior estimators. In this paper, we propose a semi-supervised approach that enables training not only on labeled simulated data generated from the model, but also on \textit{unlabeled} data originating from any source, including real data. To achieve this, we leverage Bayesian self-consistency properties that can be transformed into strictly proper losses that do not require knowledge of ground-truth parameters. We test our approach on several real-world case studies, including applications to high-dimensional time-series and image data. Our results show that semi-supervised learning with unlabeled data drastically improves the robustness of ABI in the out-of-simulation regime. Notably, inference remains accurate even when evaluated on observations far away from the labeled and unlabeled data seen during training.

CLApr 28, 2025
ICL CIPHERS: Quantifying "Learning" in In-Context Learning via Substitution Ciphers

Zhouxiang Fang, Aayush Mishra, Muhan Gao et al.

Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time ''learning'' from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve tasks reformulated by ICL CIPHERS with a BIJECTIVE mapping, which requires ''deciphering'' the latent cipher. We show that LLMs are better at solving tasks reformulated by ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify ''learning'' in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, we examine LLMs' internal representations and identify evidence in their ability to decode the ciphered inputs.

LGSep 26, 2025
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Aayush Mishra, Daniel Khashabi, Anqi Liu

Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

MLDec 16, 2025
Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Šimon Kucharský, Aayush Mishra, Daniel Habermann et al.

Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope of the statistical models), neural surrogates can behave unpredictably. This makes it a challenge in a model comparison setting, where multiple statistical models are considered, of which at least some are misspecified. Recent work on self-consistency (SC) provides a promising remedy to this issue, accessible even for empirical data (without ground-truth labels). In this work, we investigate how SC can improve amortized model comparison conceptualized in four different ways. Across two synthetic and two real-world case studies, we find that approaches for model comparison that estimate marginal likelihoods through approximate parameter posteriors consistently outperform methods that directly approximate model evidence or posterior model probabilities. SC training improves robustness when the likelihood is available, even under severe model misspecification. The benefits of SC for methods without access of analytic likelihoods are more limited and inconsistent. Our results suggest practical guidance for reliable amortized Bayesian model comparison: prefer parameter posterior-based methods and augment them with SC training on empirical datasets to mitigate extrapolation bias under model misspecification.

LGNov 16, 2025
Genomic Next-Token Predictors are In-Context Learners

Nathan Breslow, Aayush Mishra, Mahler Revsine et al.

In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

MLAug 28, 2025
Towards Trustworthy Amortized Bayesian Model Comparison

Šimon Kucharský, Aayush Mishra, Daniel Habermann et al.

Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the reliability of neural surrogates deteriorates when simulation models are misspecified - the very case where model comparison is most needed. Thus, we supplement simulation-based training with a self-consistency (SC) loss on unlabeled real data to improve BMC estimates under empirical distribution shifts. Using a numerical experiment and two case studies with real data, we compare amortized evidence estimates with and without SC against analytic or bridge sampling benchmarks. SC improves calibration under model misspecification when having access to analytic likelihoods. However, it offers limited gains with neural surrogate likelihoods, making it most practical for trustworthy BMC when likelihoods are exact.

GAJan 27, 2025
Automatic Machine Learning Framework to Study Morphological Parameters of AGN Host Galaxies within $z < 1.4$ in the Hyper Supreme-Cam Wide Survey

Chuan Tian, C. Megan Urry, Aritra Ghosh et al.

We present a composite machine learning framework to estimate posterior probability distributions of bulge-to-total light ratio, half-light radius, and flux for Active Galactic Nucleus (AGN) host galaxies within $z<1.4$ and $m<23$ in the Hyper Supreme-Cam Wide survey. We divide the data into five redshift bins: low ($0<z<0.25$), mid ($0.25<z<0.5$), high ($0.5<z<0.9$), extra ($0.9<z<1.1$) and extreme ($1.1<z<1.4$), and train our models independently in each bin. We use PSFGAN to decompose the AGN point source light from its host galaxy, and invoke the Galaxy Morphology Posterior Estimation Network (GaMPEN) to estimate morphological parameters of the recovered host galaxy. We first trained our models on simulated data, and then fine-tuned our algorithm via transfer learning using labeled real data. To create training labels for transfer learning, we used GALFIT to fit $\sim 20,000$ real HSC galaxies in each redshift bin. We comprehensively examined that the predicted values from our final models agree well with the GALFIT values for the vast majority of cases. Our PSFGAN + GaMPEN framework runs at least three orders of magnitude faster than traditional light-profile fitting methods, and can be easily retrained for other morphological parameters or on other datasets with diverse ranges of resolutions, seeing conditions, and signal-to-noise ratios, making it an ideal tool for analyzing AGN host galaxies from large surveys coming soon from the Rubin-LSST, Euclid, and Roman telescopes.

SPJan 15, 2025
On the challenges of detecting MCI using EEG in the wild

Aayush Mishra, David Joffe, Sankara Surendra Telidevara et al.

Recent studies have shown promising results in the detection of Mild Cognitive Impairment (MCI) using easily accessible Electroencephalogram (EEG) data which would help administer early and effective treatment for dementia patients. However, the reliability and practicality of such systems remains unclear. In this work, we investigate the potential limitations and challenges in developing a robust MCI detection method using two contrasting datasets: 1) CAUEEG, collected and annotated by expert neurologists in controlled settings and 2) GENEEG, a new dataset collected and annotated in general practice clinics, a setting where routine MCI diagnoses are typically made. We find that training on small datasets, as is done by most previous works, tends to produce high variance models that make overconfident predictions, and are unreliable in practice. Additionally, distribution shifts between datasets make cross-domain generalization challenging. Finally, we show that MCI detection using EEG may suffer from fundamental limitations because of the overlapping nature of feature distributions with control groups. We call for more effort in high-quality data collection in actionable settings (like general practice clinics) to make progress towards this salient goal of non-invasive MCI detection.

CVJan 19, 2024
Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation

Prakhar Kaushik, Aayush Mishra, Adam Kortylewski et al.

We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce 3DUDA, a method capable of adapting to a nuisance-ridden target domain without 3D or depth data. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled at each mesh vertex learnt using differential rendering. We focus on individual locally robust mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain even when the global pose is not correct. Our model is then trained in an EM fashion, alternating between updating the vertex features and the feature extractor. We show that our method simulates fine-tuning on a global pseudo-labeled dataset under mild assumptions, which converges to the target domain asymptotically. Through extensive empirical validation, including a complex extreme UDA setup which combines real nuisances, synthetic noise, and occlusion, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly improving pose estimation accuracy.