CLJul 4, 2023
On Evaluating and Mitigating Gender Biases in Multilingual SettingsAniket Vashishtha, Kabir Ahuja, Sunayana Sitaram
While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English. In this work, we investigate some of the challenges with evaluating and mitigating biases in multilingual settings which stem from a lack of existing benchmarks and resources for bias evaluation beyond English especially for non-western context. In this paper, we first create a benchmark for evaluating gender biases in pre-trained masked language models by extending DisCo to different Indian languages using human annotations. We extend various debiasing methods to work beyond English and evaluate their effectiveness for SOTA massively multilingual models on our proposed metric. Overall, our work highlights the challenges that arise while studying social biases in multilingual settings and provides resources as well as mitigation techniques to take a step toward scaling to more languages.
19.7CLMar 29
Improving Clinical Diagnosis with Counterfactual Multi-Agent ReasoningZhiwen You, Xi Chen, Aniket Vashishtha et al. · cmu
Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
AIOct 23, 2023
Causal Order: The Key to Leveraging Imperfect Experts in Causal InferenceAniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar et al.
Large Language Models (LLMs) have been used as experts to infer causal graphs, often by repeatedly applying a pairwise prompt that asks about the causal relationship of each variable pair. However, such experts, including human domain experts, cannot distinguish between direct and indirect effects given a pairwise prompt. Therefore, instead of the graph, we propose that causal order be used as a more stable output interface for utilizing expert knowledge. Even when querying a perfect expert with a pairwise prompt, we show that the inferred graph can have significant errors whereas the causal order is always correct. In practice, however, LLMs are imperfect experts and we find that pairwise prompts lead to multiple cycles. Hence, we propose the triplet method, a novel querying strategy that introduces an auxiliary variable for every variable pair and instructs the LLM to avoid cycles within this triplet. It then uses a voting-based ensemble method that results in higher accuracy and fewer cycles while ensuring cost efficiency. Across multiple real-world graphs, such a triplet-based method yields a more accurate order than the pairwise prompt, using both LLMs and human annotators. The triplet method enhances robustness by repeatedly querying an expert with different auxiliary variables, enabling smaller models like Phi-3 and Llama-3 8B Instruct to surpass GPT-4 with pairwise prompting. For practical usage, we show how the expert-provided causal order from the triplet method can be used to reduce error in downstream graph discovery and effect inference tasks.
66.3AIMay 25
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI ScientistsJunlin Yang, Dylan Zhang, Xiangchen Song et al.
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.
LGJul 10, 2024
Teaching Transformers Causal Reasoning through Axiomatic TrainingAniket Vashishtha, Abhinav Kumar, Atharva Pandey et al.
For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since active interventions are costly, we study to what extent a system can learn causal reasoning from symbolic demonstrations of causal axioms. Specifically, we present an axiomatic training method where the system learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the system would learn to generalize from the axiom demonstrations to more complex scenarios. Our results, based on applying axiomatic training to learn the transitivity axiom and d-separation rule, indicate that such generalization is possible. To avoid data contamination issues, we start with a 67 million parameter transformer model and train it from scratch. On both tasks, we find that a model trained on linear causal chains (along with some noisy variations) can generalize well to complex graphs, including longer causal chains, causal chains with reversed order, and graphs with branching.To handle diverse text inputs, the same method is extended to finetune language models. Finetuning Llama-3-8B-Instruct model on our axiomatic data leads to significant gains on causal benchmarks such as Corr2Cause and CLEAR, in some cases providing state-of-the-art performance surpassing GPT-4.
AISep 14, 2020Code
VacSIM: Learning Effective Strategies for COVID-19 Vaccine Distribution using Reinforcement LearningRaghav Awasthi, Keerat Kaur Guliani, Saif Ahmad Khan et al.
A COVID-19 vaccine is our best bet for mitigating the ongoing onslaught of the pandemic. However, vaccine is also expected to be a limited resource. An optimal allocation strategy, especially in countries with access inequities and temporal separation of hot-spots, might be an effective way of halting the disease spread. We approach this problem by proposing a novel pipeline VacSIM that dovetails Deep Reinforcement Learning models into a Contextual Bandits approach for optimizing the distribution of COVID-19 vaccine. Whereas the Reinforcement Learning models suggest better actions and rewards, Contextual Bandits allow online modifications that may need to be implemented on a day-to-day basis in the real world scenario. We evaluate this framework against a naive allocation approach of distributing vaccine proportional to the incidence of COVID-19 cases in five different States across India (Assam, Delhi, Jharkhand, Maharashtra and Nagaland) and demonstrate up to 9039 potential infections prevented and a significant increase in the efficacy of limiting the spread over a period of 45 days through the VacSIM approach. Our models and the platform are extensible to all states of India and potentially across the globe. We also propose novel evaluation strategies including standard compartmental model-based projections and a causality-preserving evaluation of our model. Since all models carry assumptions that may need to be tested in various contexts, we open source our model VacSIM and contribute a new reinforcement learning environment compatible with OpenAI gym to make it extensible for real-world applications across the globe. (http://vacsim.tavlab.iiitd.edu.in:8000/).
LGOct 18, 2025
Realizing LLMs' Causal Potential Requires Science-Grounded, Novel BenchmarksAshutosh Srivastava, Lokesh Nagalapatti, Gautam Jajoo et al.
Recent claims of strong performance by Large Language Models (LLMs) on causal discovery are undermined by a key flaw: many evaluations rely on benchmarks likely included in pretraining corpora. Thus, apparent success suggests that LLM-only methods, which ignore observational data, outperform classical statistical approaches. We challenge this narrative by asking: Do LLMs truly reason about causal structure, and how can we measure it without memorization concerns? Can they be trusted for real-world scientific discovery? We argue that realizing LLMs' potential for causal analysis requires two shifts: (P.1) developing robust evaluation protocols based on recent scientific studies to guard against dataset leakage, and (P.2) designing hybrid methods that combine LLM-derived knowledge with data-driven statistics. To address P.1, we encourage evaluating discovery methods on novel, real-world scientific studies. We outline a practical recipe for extracting causal graphs from recent publications released after an LLM's training cutoff, ensuring relevance and preventing memorization while capturing both established and novel relations. Compared to benchmarks like BNLearn, where LLMs achieve near-perfect accuracy, they perform far worse on our curated graphs, underscoring the need for statistical grounding. Supporting P.2, we show that using LLM predictions as priors for the classical PC algorithm significantly improves accuracy over both LLM-only and purely statistical methods. We call on the community to adopt science-grounded, leakage-resistant benchmarks and invest in hybrid causal discovery methods suited to real-world inquiry.
LGOct 2, 2025
Executable Counterfactuals: Improving LLMs' Causal Reasoning Through CodeAniket Vashishtha, Qirun Dai, Hongyuan Mei et al.
Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternatives (interventions), and predicting their outcomes (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to overestimation of LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for SOTA models like o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-else condition and test on out-of-domain code structures (e.g. having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While supervised finetuning on stronger models' reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on OOD tasks such as counterfactual math problems. In contrast, reinforcement learning induces the core cognitive behaviors and generalizes to new domains, yielding gains over the base model on both code (improvement of 1.5x-2x) and math problems. Analysis of the reasoning traces reinforces these findings and highlights the promise of RL for improving LLMs' counterfactual reasoning.
CLApr 2, 2021
Mining Trends of COVID-19 Vaccine Beliefs on Twitter with Lexical EmbeddingsHarshita Chopra, Aniket Vashishtha, Ridam Pal et al.
Social media plays a pivotal role in disseminating news globally and acts as a platform for people to express their opinions on various topics. A wide variety of views accompanies COVID-19 vaccination drives across the globe, often colored by emotions, which change along with rising cases, approval of vaccines, and multiple factors discussed online. This study aims at analyzing the temporal evolution of different Emotion categories: Hesitation, Rage, Sorrow, Anticipation, Faith, and Contentment with Influencing Factors: Vaccine Rollout, Misinformation, Health Effects, and Inequities as lexical categories created from Tweets belonging to five countries with vital vaccine roll-out programs, namely, India, United States of America, Brazil, United Kingdom, and Australia. We extracted a corpus of nearly 1.8 million Twitter posts related to COVID-19 vaccination. Using cosine distance from selected seed words, we expanded the vocabulary of each category and tracked the longitudinal change in their strength from June 2020 to April 2021. We used community detection algorithms to find modules in positive correlation networks. Our findings suggest that tweets expressing hesitancy towards vaccines contain the highest mentions of health-related effects in all countries. Our results indicated that the patterns of hesitancy were variable across geographies and can help us learn targeted interventions. We also observed a significant change in the linear trends of categories like hesitation and contentment before and after approval of vaccines. Negative emotions like rage and sorrow gained the highest importance in the alluvial diagram. They formed a significant module with all the influencing factors in April 2021, when India observed the second wave of COVID-19 cases. The relationship between Emotions and Influencing Factors was found to be variable across the countries.