Harini Suresh

h-index21

15papers

1,833citations

Novelty40%

AI Score44

Ranked #47,101 of 194,257 authors (top 24%)#10,822 in LG (top 27%)

15 Papers

12.4LGJun 7, 2022Code

Saliency Cards: A Framework to Characterize and Compare Saliency Methods

Angie Boggust, Harini Suresh, Hendrik Strobelt et al. · mit

Saliency methods are a common class of machine learning interpretability techniques that calculate how important each input feature is to a model's output. We find that, with the rapid pace of development, users struggle to stay informed of the strengths and limitations of new methods and, thus, choose methods for unprincipled reasons (e.g., popularity). Moreover, despite a corresponding rise in evaluation metrics, existing approaches assume universal desiderata for saliency methods (e.g., faithfulness) that do not account for diverse user needs. In response, we introduce saliency cards: structured documentation of how saliency methods operate and their performance across a battery of evaluative metrics. Through a review of 25 saliency method papers and 33 method evaluations, we identify 10 attributes that users should account for when choosing a method. We group these attributes into three categories that span the process of computing and interpreting saliency: methodology, or how the saliency is calculated; sensitivity, or the relationship between the saliency and the underlying model and data; and, perceptibility, or how an end user ultimately interprets the result. By collating this information, saliency cards allow users to more holistically assess and compare the implications of different methods. Through nine semi-structured interviews with users from various backgrounds, including researchers, radiologists, and computational biologists, we find that saliency cards provide a detailed vocabulary for discussing individual methods and allow for a more systematic selection of task-appropriate methods. Moreover, with saliency cards, we are able to analyze the research landscape in a more structured fashion to identify opportunities for new methods and evaluation metrics for unmet user needs.

8.7LGJun 27, 2022

Improved Text Classification via Test-Time Augmentation

Helen Lu, Divya Shanmugam, Harini Suresh et al.

Test-time augmentation -- the aggregation of predictions across transformed examples of test inputs -- is an established technique to improve the performance of image classification models. Importantly, TTA can be used to improve model performance post-hoc, without additional training. Although test-time augmentation (TTA) can be applied to any data modality, it has seen limited adoption in NLP due in part to the difficulty of identifying label-preserving transformations. In this paper, we present augmentation policies that yield significant accuracy improvements with language models. A key finding is that augmentation policy design -- for instance, the number of samples generated from a single, non-deterministic augmentation -- has a considerable impact on the benefit of TTA. Experiments across a binary classification task and dataset show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches.

9.4CYMay 15

How to Stop Playing Whack-a-Mole: Mapping the Ecosystem of Technologies Facilitating AI-Generated Non-Consensual Intimate Images

Michelle L. Ding, Harini Suresh, Suresh Venkatasubramanian

The last decade has witnessed a rapid advancement of generative AI technology that significantly scaled the accessibility of AI-generated non-consensual intimate images (AIG-NCII), a form of image-based sexual abuse that disproportionately harms and silences women and girls. There is a patchwork of commendable efforts across industry, policy, academia, and civil society to address AIG-NCII. However, these efforts lack a shared, consistent mental model that clearly situates the technologies they target within the context of a large, interconnected, and ever-evolving technological ecosystem. As a result, interventions remain siloed and are difficult to evaluate and compare, leading to a reactive cycle of whack-a-mole. In this paper, we contribute the first comprehensive AIG-NCII technological ecosystem that maps and taxonomizes 11 categories of technologies facilitating the creation, distribution, proliferation and discovery, infrastructural support, and monetization of AIG-NCII. First, we build and visualize the ecosystem through a synthesis of over a hundred primary sources from researchers, journalists, advocates, policymakers, and technologists. Then, we conduct two detailed walkthroughs to demonstrate the usefulness of the ecosystem in 1) making sense of new AIG-NCII harms using a case study of Grok and 2) mapping a clearer tech policy landscape using U.S. federal law and 63 state laws. We conclude with a vision for future AIG-NCII research that refines the edges of the ecosystem, recommending researchers to study critical relationships between technologies and potential ripple effects from different interventions. Our goal is to produce an AIG-NCII technological ecosystem that provides a clear, shared terminology and framework for stakeholders to move into the future of AIG-NCII prevention with clarity and foresight.

15.9HCApr 24, 2025Code

The Malicious Technical Ecosystem: Exposing Limitations in Technical Governance of AI-Generated Non-Consensual Intimate Images of Adults

Michelle L. Ding, Harini Suresh

In this paper, we adopt a survivor-centered approach to locate and dissect the role of sociotechnical AI governance in preventing AI-Generated Non-Consensual Intimate Images (AIG-NCII) of adults, colloquially known as "deep fake pornography." We identify a "malicious technical ecosystem" or "MTE," comprising of open-source face-swapping models and nearly 200 "nudifying" software programs that allow non-technical users to create AIG-NCII within minutes. Then, using the National Institute of Standards and Technology (NIST) AI 100-4 report as a reflection of current synthetic content governance methods, we show how the current landscape of practices fails to effectively regulate the MTE for adult AIG-NCII, as well as flawed assumptions explaining these gaps.

20.7HCJan 28, 2025

"Ownership, Not Just Happy Talk": Co-Designing a Participatory Large Language Model for Journalism

Emily Tseng, Meg Young, Marianne Aubin Le Quéré et al.

Journalism has emerged as an essential domain for understanding the uses, limitations, and impacts of large language models (LLMs) in the workplace. News organizations face divergent financial incentives: LLMs already permeate newswork processes within financially constrained organizations, even as ongoing legal challenges assert that AI companies violate their copyright. At stake are key questions about what LLMs are created to do, and by whom: How might a journalist-led LLM work, and what can participatory design illuminate about the present-day challenges about adapting ``one-size-fits-all'' foundation models to a given context of use? In this paper, we undertake a co-design exploration to understand how a participatory approach to LLMs might address opportunities and challenges around AI in journalism. Our 20 interviews with reporters, data journalists, editors, labor organizers, product leads, and executives highlight macro, meso, and micro tensions that designing for this opportunity space must address. From these desiderata, we describe the result of our co-design work: organizational structures and functionality for a journalist-controlled LLM. In closing, we discuss the limitations of commercial foundation models for workplace use, and the methodological implications of applying participatory methods to LLM co-design.

16.7HCFeb 17, 2021

Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

Harini Suresh, Kathleen M. Lewis, John V. Guttag et al.

Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models. However, existing approaches often rely on abstract, complex visualizations that poorly map to the task at hand or require non-trivial ML expertise to interpret. Here, we present two visual analytics modules that facilitate an intuitive assessment of model reliability. To help users better characterize and reason about a model's uncertainty, we visualize raw and aggregate information about a given input's nearest neighbors. Using an interactive editor, users can manipulate this input in semantically-meaningful ways, determine the effect on the output, and compare against their prior expectations. We evaluate our interface using an electrocardiogram beat classification case study. Compared to a baseline feature importance interface, we find that 14 physicians are better able to align the model's uncertainty with domain-relevant factors and build intuition about its capabilities and limitations.

43.5LGNov 6, 2020

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Alexander D'Amour, Katherine Heller, Dan Moldovan et al.

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

2.3QMNov 30, 2019

Image segmentation of liver stage malaria infection with spatial uncertainty sampling

Ava P. Soleimany, Harini Suresh, Jose Javier Gonzalez Ortiz et al.

Global eradication of malaria depends on the development of drugs effective against the silent, yet obligate liver stage of the disease. The gold standard in drug development remains microscopic imaging of liver stage parasites in in vitro cell culture models. Image analysis presents a major bottleneck in this pipeline since the parasite has significant variability in size, shape, and density in these models. As with other highly variable datasets, traditional segmentation models have poor generalizability as they rely on hand-crafted features; thus, manual annotation of liver stage malaria images remains standard. To address this need, we develop a convolutional neural network architecture that utilizes spatial dropout sampling for parasite segmentation and epistemic uncertainty estimation in images of liver stage malaria. Our pipeline produces high-precision segmentations nearly identical to expert annotations, generalizes well on a diverse dataset of liver stage malaria parasites, and promotes independence between learned feature maps to model the uncertainty of generated predictions.

35.7LGJan 28, 2019

A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle

Harini Suresh, John V. Guttag

As machine learning (ML) increasingly affects people and society, awareness of its potential unwanted consequences has also grown. To anticipate, prevent, and mitigate undesirable downstream consequences, it is critical that we understand when and how harm might be introduced throughout the ML life cycle. In this paper, we provide a framework that identifies seven distinct potential sources of downstream harm in machine learning, spanning data collection, development, and deployment. In doing so, we aim to facilitate more productive and precise communication around these issues, as well as more direct, application-grounded ways to mitigate them.

4.4AIJun 30, 2018Code

Modeling Mistrust in End-of-Life Care

Willie Boag, Harini Suresh, Leo Anthony Celi et al.

In this work, we characterize the doctor-patient relationship using a machine learning-derived trust score. We show that this score has statistically significant racial associations, and that by modeling trust directly we find stronger disparities in care than by stratifying on race. We further demonstrate that mistrust is indicative of worse outcomes, but is only weakly associated with physiologically-created severity scores. Finally, we describe sentiment analysis experiments indicating patients with higher levels of mistrust have worse experiences and interactions with their caregivers. This work is a step towards measuring fairer machine learning in the healthcare domain.

16.3LGJun 7, 2018Code

Learning Tasks for Multitask Learning: Heterogenous Patient Populations in the ICU

Harini Suresh, Jen J. Gong, John Guttag

Machine learning approaches have been effective in predicting adverse outcomes in different clinical settings. These models are often developed and evaluated on datasets with heterogeneous patient populations. However, good predictive performance on the aggregate population does not imply good performance for specific groups. In this work, we present a two-step framework to 1) learn relevant patient subgroups, and 2) predict an outcome for separate patient populations in a multi-task framework, where each population is a separate task. We demonstrate how to discover relevant groups in an unsupervised way with a sequence-to-sequence autoencoder. We show that using these groups in a multi-task framework leads to better predictive performance of in-hospital mortality both across groups and overall. We also highlight the need for more granular evaluation of performance when dealing with heterogeneous populations.

14.4LGMay 23, 2017

Clinical Intervention Prediction and Understanding using Deep Networks

Harini Suresh, Nathan Hunt, Alistair Johnson et al.

Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs). This task is complicated by data sources that are noisy, sparse, heterogeneous and outcomes that are imbalanced. In this paper, we integrate data from all available ICU sources (vitals, labs, notes, demographics) and focus on learning rich representations of this data to predict onset and weaning of multiple invasive interventions. In particular, we compare both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five intervention tasks: invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, and crystalloid boluses. Our predictions are done in a forward-facing manner to enable "real-time" performance, and predictions are made with a six hour gap time to support clinically actionable planning. We achieve state-of-the-art results on our predictive tasks using deep architectures. We explore the use of feature occlusion to interpret LSTM models, and compare this to the interpretability gained from examining inputs that maximally activate CNN outputs. We show that our models are able to significantly outperform baselines in intervention prediction, and provide insight into model learning, which is crucial for the adoption of such models in practice.

7.7LGMar 20, 2017

The Use of Autoencoders for Discovering Patient Phenotypes

Harini Suresh, Peter Szolovits, Marzyeh Ghassemi

We use autoencoders to create low-dimensional embeddings of underlying patient phenotypes that we hypothesize are a governing factor in determining how different patients will react to different interventions. We compare the performance of autoencoders that take fixed length sequences of concatenated timesteps as input with a recurrent sequence-to-sequence autoencoder. We evaluate our methods on around 35,500 patients from the latest MIMIC III dataset from Beth Israel Deaconess Hospital.

2.9AIDec 16, 2015

Feature Representation for ICU Mortality

Harini Suresh

Good predictors of ICU Mortality have the potential to identify high-risk patients earlier, improve ICU resource allocation, or create more accurate population-level risk models. Machine learning practitioners typically make choices about how to represent features in a particular model, but these choices are seldom evaluated quantitatively. This study compares the performance of different representations of clinical event data from MIMIC II in a logistic regression model to predict 36-hour ICU mortality. The most common representations are linear (normalized counts) and binary (yes/no). These, along with a new representation termed "hill", are compared using both L1 and L2 regularization. Results indicate that the introduced "hill" representation outperforms both the binary and linear representations, the hill representation thus has the potential to improve existing models of ICU mortality.

2.2CLJan 12, 2015

Autodetection and Classification of Hidden Cultural City Districts from Yelp Reviews

Harini Suresh, Nicholas Locascio

Topic models are a way to discover underlying themes in an otherwise unstructured collection of documents. In this study, we specifically used the Latent Dirichlet Allocation (LDA) topic model on a dataset of Yelp reviews to classify restaurants based off of their reviews. Furthermore, we hypothesize that within a city, restaurants can be grouped into similar "clusters" based on both location and similarity. We used several different clustering methods, including K-means Clustering and a Probabilistic Mixture Model, in order to uncover and classify districts, both well-known and hidden (i.e. cultural areas like Chinatown or hearsay like "the best street for Italian restaurants") within a city. We use these models to display and label different clusters on a map. We also introduce a topic similarity heatmap that displays the similarity distribution in a city to a new restaurant.