LGMar 20Code
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-CalculusAmartya Roy, Rasul Tutunov, Xiaotong Ji et al.
LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.
AIMay 26
Why LLMs Fail at Causal Discovery and How Interventional Agents EscapeAmartya Roy, Sonali Parbhoo
Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing
CLDec 11, 2025
Causal Reasoning Favors Encoders: On The Limits of Decoder-Only ModelsAmartya Roy, Elamparithy M, Kripabandhu Ghosh et al.
In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
LGAug 11, 2023
Size Lowerbounds for Deep Operator NetworksAnirbit Mukherjee, Amartya Roy
Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $Ω\left ( \sqrt[\leftroot{-1}\uproot{-1}4]{n} \right )$. This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.
CLSep 24, 2024
Do the Right Thing, Just Debias! Multi-Category Bias Mitigation Using LLMsAmartya Roy, Danush Khanna, Devanshu Mahapatra et al.
This paper tackles the challenge of building robust and generalizable bias mitigation models for language. Recognizing the limitations of existing datasets, we introduce ANUBIS, a novel dataset with 1507 carefully curated sentence pairs encompassing nine social bias categories. We evaluate state-of-the-art models like T5, utilizing Supervised Fine-Tuning (SFT), Reinforcement Learning (PPO, DPO), and In-Context Learning (ICL) for effective bias mitigation. Our analysis focuses on multi-class social bias reduction, cross-dataset generalizability, and environmental impact of the trained models. ANUBIS and our findings offer valuable resources for building more equitable AI systems and contribute to the development of responsible and unbiased technologies with broad societal impact.
CLDec 7, 2024
On the effective transfer of knowledge from English to Hindi WikipediaParamita Das, Amartya Roy, Ritabrata Chakraborty et al.
Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its \textit{neutral point of view} (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
LGOct 23, 2025
Competition is the key: A Game Theoretic Causal Discovery ApproachAmartya Roy, Souvik Chakraborty
Causal discovery remains a central challenge in machine learning, yet existing methods face a fundamental gap: algorithms like GES and GraN-DAG achieve strong empirical performance but lack finite-sample guarantees, while theoretically principled approaches fail to scale. We close this gap by introducing a game-theoretic reinforcement learning framework for causal discovery, where a DDQN agent directly competes against a strong baseline (GES or GraN-DAG), always warm-starting from the opponent's solution. This design yields three provable guarantees: the learned graph is never worse than the opponent, warm-starting strictly accelerates convergence, and most importantly, with high probability the algorithm selects the true best candidate graph. To the best of our knowledge, our result makes a first-of-its-kind progress in explaining such finite-sample guarantees in causal discovery: on synthetic SEMs (30 nodes), the observed error probability decays with n, tightly matching theory. On real-world benchmarks including Sachs, Asia, Alarm, Child, Hepar2, Dream, and Andes, our method consistently improves upon GES and GraN-DAG while remaining theoretically safe. Remarkably, it scales to large graphs such as Hepar2 (70 nodes), Dream (100 nodes), and Andes (220 nodes). Together, these results establish a new class of RL-based causal discovery algorithms that are simultaneously provably consistent, sample-efficient, and practically scalable, marking a decisive step toward unifying empirical performance with rigorous finite-sample theory.
LGSep 28, 2025
Guide: Generalized-Prior and Data Encoders for DAG EstimationAmartya Roy, Devharish N, Shreya Ganguly et al.
Modern causal discovery methods face critical limitations in scalability, computational efficiency, and adaptability to mixed data types, as evidenced by benchmarks on node scalability (30, $\le 50$, $\ge 70$ nodes), computational energy demands, and continuous/non-continuous data handling. While traditional algorithms like PC, GES, and ICA-LiNGAM struggle with these challenges, exhibiting prohibitive energy costs for higher-order nodes and poor scalability beyond 70 nodes, we propose \textbf{GUIDE}, a framework that integrates Large Language Model (LLM)-generated adjacency matrices with observational data through a dual-encoder architecture. GUIDE uniquely optimizes computational efficiency, reducing runtime on average by $\approx 42%$ compared to RL-BIC and KCRL methods, while achieving an average $\approx 117%$ improvement in accuracy over both NOTEARS and GraN-DAG individually. During training, GUIDE's reinforcement learning agent dynamically balances reward maximization (accuracy) and penalty avoidance (DAG constraints), enabling robust performance across mixed data types and scalability to $\ge 70$ nodes -- a setting where baseline methods fail.