IRMar 20Code
AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AIShreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo et al.
Systematic literature reviews are essential for synthesizing scientific evidence but are costly, difficult to scale and time-intensive, creating bottlenecks for evidence-based policy. We study whether large language models can automate the complete systematic review workflow, from article retrieval, article screening, data extraction to report synthesis. Applied to epidemiological reviews of nine WHO-designated priority pathogens and validated against expert-curated ground truth, our open-source agentic pipeline (AgentSLR) achieves performance comparable to human researchers while reducing review time from approximately 7 weeks to 20 hours (a 58x speed-up). Our comparison of five frontier models reveals that performance on SLR is driven less by model size or inference cost than by each model's distinctive capabilities. Through human-in-the-loop validation, we identify key failure modes. Our results demonstrate that agentic AI can substantially accelerate scientific evidence synthesis in specialised domains.
LGDec 3, 2025
Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of ValueJoe Edelman, Tan Zhi-Xuan, Ryan Lowe et al.
Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.
CLNov 3, 2025
Measuring what Matters: Construct Validity in Large Language Model BenchmarksAndrew M. Bean, Ryan Othniel Kearns, Angelika Romanou et al.
Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.
CLMar 12
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document CollectionsŁukasz Borchmann, Jordy Van Landeghem, Michał Turski et al.
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
LGSep 11, 2025Code
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual ExplanationsHarry Mayne, Ryan Othniel Kearns, Yushi Yang et al.
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
CLJun 10, 2025Code
Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM ReasoningAndrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss et al.
We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model's incorrect answers are ETR-predicted fallacies $(ρ=0.360, p=0.0265)$, while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic, contamination-resistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.
AIMar 15, 2023
Contextual TrustRyan Othniel Kearns
Trust is an important aspect of human life. It provides instrumental value in allowing us to collaborate on and defer actions to others, and intrinsic value in our intimate relationships with romantic partners, family, and friends. In this paper I examine the nature of trust from a philosophical perspective. Specifically I propose to view trust as a context-sensitive state in a manner that will be made precise. The contribution of this paper is threefold. First, I make the simple observation that an individual's trust is typically both action- and context-sensitive. Action-sensitivity means that trust may obtain between a given truster and trustee for only certain actions. Context-sensitivity means that trust may obtain between a given truster and trustee, regarding the same action, in some conditions and not others. I also opine about what kinds of things may play the role of the truster, trustee, and action. Second, I advance a theory for the nature of contextual trust. I propose that the answer to "What does it mean for $A$ to trust $B$ to do $X$ in context $C$?" has two conditions. First, $A$ must take $B$'s doing $X$ as a means towards one of $A$'s ends. Second, $A$ must adopt an unquestioning attitude concerning $B$'s doing $X$ in context $C$. This unquestioning attitude is similar to the attitude introduced in Nguyen 2021. Finally, we explore how contextual trust can help us make sense of trust in general non-interpersonal settings, notably that of artificial intelligence (AI) systems. The field of Explainable Artificial Intelligence (XAI) assigns paramount importance to the problem of user trust in opaque computational models, yet does little to give trust diagnostic or even conceptual criteria. I propose that contextual trust is a natural fit for the task by illustrating that model transparency and explainability map nicely into our construction of the contexts $C$.
AIFeb 17
Quantifying construct validity in large language model evaluationsRyan Othniel Kearns
The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.