Nari Johnson

CY
h-index23
11papers
345citations
Novelty31%
AI Score40

11 Papers

LGJun 22, 2022Code
OpenXAI: Towards a Transparent Evaluation of Model Explanations

Chirag Agarwal, Dan Ley, Satyapriya Krishna et al. · cmu, harvard

While several types of post hoc explanation methods have been proposed in recent literature, there is very little work on systematically benchmarking these methods. Here, we introduce OpenXAI, a comprehensive and extensible open-source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, and (ii) open-source implementations of eleven quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, in turn providing comparisons of several explanation methods across a wide variety of metrics, models, and datasets. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. While the first release of OpenXAI supports only tabular datasets, the explanation methods and metrics that we consider are general enough to be applicable to other data modalities. OpenXAI datasets and models, implementations of state-of-the-art explanation methods and evaluation metrics, are publicly available at this GitHub link.

LGMar 14, 2022
Rethinking Stability for Attribution-based Explanations

Chirag Agarwal, Nari Johnson, Martin Pawelczyk et al. · cmu, harvard

As attribution-based explanation methods are increasingly used to establish model trustworthiness in high-stakes situations, it is critical to ensure that these explanations are stable, e.g., robust to infinitesimal perturbations to an input. However, previous works have shown that state-of-the-art explanation methods generate unstable explanations. Here, we introduce metrics to quantify the stability of an explanation and show that several popular explanation methods are unstable. In particular, we propose new Relative Stability metrics that measure the change in output explanation with respect to change in input, model representation, or output of the underlying predictor. Finally, our experimental evaluation with three real-world datasets demonstrates interesting insights for seven explanation methods and different stability metrics.

HCJun 5, 2022
Use-Case-Grounded Simulations for Explanation Evaluation

Valerie Chen, Nari Johnson, Nicholay Topin et al. · cmu

A growing body of research runs human subject evaluations to study whether providing users with explanations of machine learning models can help them with practical real-world use cases. However, running user studies is challenging and costly, and consequently each study typically only evaluates a limited number of different settings, e.g., studies often only evaluate a few arbitrarily selected explanation methods. To address these challenges and aid user study design, we introduce Use-Case-Grounded Simulated Evaluations (SimEvals). SimEvals involve training algorithmic agents that take as input the information content (such as model explanations) that would be presented to each participant in a human subject study, to predict answers to the use case of interest. The algorithmic agent's test set accuracy provides a measure of the predictiveness of the information content for the downstream use case. We run a comprehensive evaluation on three real-world use cases (forward simulation, model debugging, and counterfactual reasoning) to demonstrate that Simevals can effectively identify which explanation methods will help humans for each use case. These results provide evidence that SimEvals can be used to efficiently screen an important set of user study design decisions, e.g. selecting which explanations should be presented to the user, before running a potentially costly user study.

HCJun 13, 2023
Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb et al. · cmu

Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.

LGJul 8, 2022
Towards a More Rigorous Science of Blindspot Discovery in Image Classification Models

Gregory Plumb, Nari Johnson, Ángel Alexander Cabrera et al. · cmu

A growing body of work studies Blindspot Discovery Methods ("BDM"s): methods that use an image embedding to find semantically meaningful (i.e., united by a human-understandable concept) subsets of the data where an image classifier performs significantly worse. Motivated by observed gaps in prior work, we introduce a new framework for evaluating BDMs, SpotCheck, that uses synthetic image datasets to train models with known blindspots and a new BDM, PlaneSpot, that uses a 2D image representation. We use SpotCheck to run controlled experiments that identify factors that influence BDM performance (e.g., the number of blindspots in a model, or features used to define the blindspot) and show that PlaneSpot is competitive with and in many cases outperforms existing BDMs. Importantly, we validate these findings by designing additional experiments that use real image data from MS-COCO, a large image benchmark dataset. Our findings suggest several promising directions for future work on BDM design and evaluation. Overall, we hope that the methodology and analyses presented in this work will help facilitate a more rigorous science of blindspot discovery.

67.8HCApr 1
Disclosure or Marketing? Analyzing the Efficacy of Vendor Self-reports for Vetting Public-sector AI

Blaine Kuehnert, Nari Johnson, Ravit Dotan et al.

Documentation-based disclosure has become a central governance strategy for responsible AI, particularly in public-sector procurement. Tools such as model cards, datasheets, and AI FactSheets are increasingly expected to support accountability, risk assessment, and informed decision-making across organizational boundaries. Yet there is limited empirical evidence about how these artifacts are produced, interpreted, and used in practice. In this paper, we present a qualitative study of the GovAI Coalition FactSheet, a widely adopted transparency document designed to support AI procurement and governance in government contexts. Drawing on semi-structured interviews with vendors and public-sector practitioners, alongside a systematic analysis of completed FactSheets, we examine how FactSheets are used, what information they surface, and where they fall short. We find that FactSheets are asked to serve multiple and conflicting purposes simultaneously: showcasing vendor offerings, supporting evaluation and due diligence, and facilitating early-stage dialogue between vendors and agencies. These competing expectations, combined with the structural constraints of voluntary and public self-disclosure, limit the ability of FactSheets to function as standalone evaluation or risk-assessment tools. At the same time, our findings suggest that when understood as relational artifacts used to establish trust, shared understanding, and ongoing dialogue, FactSheets can help create conditions that support more meaningful disclosure and governance over time.

90.0CYApr 2
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

Nari Johnson, Deepthi Sudharsan, Hamna et al.

Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.

CYNov 19, 2023
Assessing AI Impact Assessments: A Classroom Study

Nari Johnson, Hoda Heidari

Artificial Intelligence Impact Assessments ("AIIAs"), a family of tools that provide structured processes to imagine the possible impacts of a proposed AI system, have become an increasingly popular proposal to govern AI systems. Recent efforts from government or private-sector organizations have proposed many diverse instantiations of AIIAs, which take a variety of forms ranging from open-ended questionnaires to graded score-cards. However, to date that has been limited evaluation of existing AIIA instruments. We conduct a classroom study (N = 38) at a large research-intensive university (R1) in an elective course focused on the societal and ethical implications of AI. We assign students to different organizational roles (for example, an ML scientist or product manager) and ask participant teams to complete one of three existing AI impact assessments for one of two imagined generative AI systems. In our thematic analysis of participants' responses to pre- and post-activity questionnaires, we find preliminary evidence that impact assessments can influence participants' perceptions of the potential risks of generative AI systems, and the level of responsibility held by AI experts in addressing potential harm. We also discover a consistent set of limitations shared by several existing AIIA instruments, which we group into concerns about their format and content, as well as the feasibility and effectiveness of the activity in foreseeing and mitigating potential harms. Drawing on the findings of this study, we provide recommendations for future work on developing and validating AIIAs.

CYNov 14, 2024
Provocation: Who benefits from "inclusion" in Generative AI?

Samantha Dalal, Siobhan Mackenzie Hall, Nari Johnson

The demands for accurate and representative generative AI systems means there is an increased demand on participatory evaluation structures. While these participatory structures are paramount to to ensure non-dominant values, knowledge and material culture are also reflected in AI models and the media they generate, we argue that dominant structures of community participation in AI development and evaluation are not explicit enough about the benefits and harms that members of socially marginalized groups may experience as a result of their participation. Without explicit interrogation of these benefits by AI developers, as a community we may remain blind to the immensity of systemic change that is needed as well. To support this provocation, we present a speculative case study, developed from our own collective experiences as AI researchers. We use this speculative context to itemize the barriers that need to be overcome in order for the proposed benefits to marginalized communities to be realized, and harms mitigated.

CYNov 7, 2024
Legacy Procurement Practices Shape How U.S. Cities Govern AI: Understanding Government Employees' Practices, Challenges, and Needs

Nari Johnson, Elise Silva, Harrison Leon et al.

Most AI tools adopted by governments are not developed internally, but instead are acquired from third-party vendors in a process called public procurement. In this paper, we conduct the first empirical study of how United States cities' procurement practices shape critical decisions surrounding public sector AI. We conduct semi-structured interviews with 19 city employees who oversee AI procurement across 7 U.S. cities. We found that cities' legacy procurement practices, which are shaped by decades-old laws and norms, establish infrastructure that determines which AI is purchased, and which actors hold decision-making power over procured AI. We characterize the emerging actions cities have taken to adapt their purchasing practices to address algorithmic harms. From employees' reflections on real-world AI procurements, we identify three key challenges that motivate but are not fully addressed by existing AI procurement reform initiatives. Based on these findings, we discuss implications and opportunities for the FAccT community to support cities in foreseeing and preventing AI harms throughout the public procurement processes.

LGSep 22, 2021
Learning Predictive and Interpretable Timeseries Summaries from ICU Data

Nari Johnson, Sonali Parbhoo, Andrew Slavin Ross et al.

Machine learning models that utilize patient data across time (rather than just the most recent measurements) have increased performance for many risk stratification tasks in the intensive care unit. However, many of these models and their learned representations are complex and therefore difficult for clinicians to interpret, creating challenges for validation. Our work proposes a new procedure to learn summaries of clinical time-series that are both predictive and easily understood by humans. Specifically, our summaries consist of simple and intuitive functions of clinical data (e.g. falling mean arterial pressure). Our learned summaries outperform traditional interpretable model classes and achieve performance comparable to state-of-the-art deep learning models on an in-hospital mortality classification task.