Gabriella Waters

CY
h-index22
5papers
28citations
Novelty37%
AI Score37

5 Papers

CYFeb 28
Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker's Dilemma

Reva Schwartz, Gabriella Waters

The rapid expansion of AI deployments has put organizational leaders in a decision maker's dilemma: they must govern these technologies without systematic evidence of how systems behave in their own environments. Predominant evaluation methods generate scalable, abstract measures of model capabilities but smooth over the heterogeneity of real world use, while user focused testing reveals rich contextual detail yet remains small in scale and loosely coupled to the mechanisms that shape model behavior. The Forum for Real World AI Measurement and Evaluation (FRAME) addresses this gap by combining large scale trials of AI systems with structured observation of how they are used in context, the outcomes they generate, and how those outcomes arise. By tracing the path from an AI system's output through its practical use and downstream effects, FRAME turns the heterogeneity of AI in use into a measurable signal rather than a trade off for achieving scale. FRAME establishes two core assets to accomplish this: a Testing Sandbox that captures AI use under real workflows at scale and a Metrics Hub that translates those traces into actionable indicators.

LGJan 30
A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

Samin Semsar, Kiran Laxmikant Prabhu, Gabriella Waters et al.

There are ongoing discussions about predictive policing systems, such as those deployed in Los Angeles, California and Baltimore, Maryland, being unfair, for example, by exhibiting racial bias. Studies found that unfairness may be due to feedback loops and being trained on historically biased recorded data. However, comparative studies on predictive policing systems are few and are not sufficiently comprehensive. In this work, we perform a comprehensive comparative simulation study on the fairness and accuracy of predictive policing technologies in Baltimore. Our results suggest that the situation around bias in predictive policing is more complex than was previously assumed. While predictive policing exhibited bias due to feedback loops as was previously reported, we found that the traditional alternative, hot spots policing, had similar issues. Predictive policing was found to be more fair and accurate than hot spots policing in the short term, although it amplified bias faster, suggesting the potential for worse long-run behavior. In Baltimore, in some cases the bias in these systems tended toward over-policing in White neighborhoods, unlike in previous studies. Overall, this work demonstrates a methodology for city-specific evaluation and behavioral-tendency comparison of predictive policing systems, showing how such simulations can reveal inequities and long-term tendencies.

CYMay 24, 2025
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects

Reva Schwartz, Rumman Chowdhury, Akash Kundu et al.

Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI's second-order effects, i.e. any long-term outcomes and consequences that may result from AI use in the real world, have become a significant area of interest as the technology becomes embedded in our daily lives. These secondary effects can include shifts in user behavior, societal, cultural and economic ramifications, workforce transformations, and long-term downstream impacts that may result from a broad and growing set of risks. This position paper argues that measuring the indirect and secondary effects of AI will require expansion beyond static, single-turn approaches conducted in silico to include testing paradigms that can capture what actually materializes when people use AI technology in context. Specifically, we describe the need for data and methods that can facilitate contextual awareness and enable downstream interpretation and decision making about AI's secondary effects, and recommend requirements for a new ecosystem.

CYMar 10, 2025
AI Biases as Asymmetries: A Review to Guide Practice

Gabriella Waters, Phillip Honenberger

The understanding of bias in AI is currently undergoing a revolution. Initially understood as errors or flaws, biases are increasingly recognized as integral to AI systems and sometimes preferable to less biased alternatives. In this paper, we review the reasons for this changed understanding and provide new guidance on two questions: First, how should we think about and measure biases in AI systems, consistent with the new understanding? Second, what kinds of bias in an AI system should we accept or even amplify, and what kinds should we minimize or eliminate, and why? The key to answering both questions, we argue, is to understand biases as "violations of a symmetry standard" (following Kelly). We distinguish three main types of asymmetry in AI systems-error biases, inequality biases, and process biases-and highlight places in the pipeline of AI development and application where bias of each type is likely to be good, bad, or inevitable.

CLApr 28, 2024
Bias Neutralization Framework: Measuring Fairness in Large Language Models with Bias Intelligence Quotient (BiQ)

Malur Narayan, John Pasmore, Elton Sampaio et al.

The burgeoning influence of Large Language Models (LLMs) in shaping public discourse and decision-making underscores the imperative to address inherent biases within these AI systems. In the wake of AI's expansive integration across sectors, addressing racial bias in LLMs has never been more critical. This paper introduces a novel framework called Comprehensive Bias Neutralization Framework (CBNF) which embodies an innovative approach to quantifying and mitigating biases within LLMs. Our framework combines the Large Language Model Bias Index (LLMBI) [Oketunji, A., Anas, M., Saina, D., (2023)] and Bias removaL with No Demographics (BLIND) [Orgad, H., Belinkov, Y. (2023)] methodologies to create a new metric called Bias Intelligence Quotient (BiQ)which detects, measures, and mitigates racial bias in LLMs without reliance on demographic annotations. By introducing a new metric called BiQ that enhances LLMBI with additional fairness metrics, CBNF offers a multi-dimensional metric for bias assessment, underscoring the necessity of a nuanced approach to fairness in AI [Mehrabi et al., 2021]. This paper presents a detailed analysis of Latimer AI (a language model incrementally trained on black history and culture) in comparison to ChatGPT 3.5, illustrating Latimer AI's efficacy in detecting racial, cultural, and gender biases through targeted training and refined bias mitigation strategies [Latimer & Bender, 2023].