LGAug 15, 2023
REFORMS: Reporting Standards for Machine Learning Based ScienceSayash Kapoor, Emily Cantrell, Kenny Peng et al. · princeton
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.
LGMar 12, 2022
The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learningJessica Hullman, Sayash Kapoor, Priyanka Nanayakkara et al.
Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for the integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in supervised ML research have in common with the replication crisis in experimental science puts the new concerns in perspective, and helps researchers avoid "the worst of both worlds," where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML. We identify themes that re-occur in reform discussions, like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to undisclosed sources of variance in the learning pipeline. In particular, errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally represent uncertainty in performance claims. We conclude by discussing risks that arise when sources of errors are misdiagnosed and the need to acknowledge the role of human inductive biases in learning and reform.
HCAug 10, 2023
Are We Closing the Loop Yet? Gaps in the Generalizability of VIS4ML ResearchHariharan Subramonyam, Jessica Hullman
Visualization for machine learning (VIS4ML) research aims to help experts apply their prior knowledge to develop, understand, and improve the performance of machine learning models. In conceiving VIS4ML systems, researchers characterize the nature of human knowledge to support human-in-the-loop tasks, design interactive visualizations to make ML components interpretable and elicit knowledge, and evaluate the effectiveness of human-model interchange. We survey recent VIS4ML papers to assess the generalizability of research contributions and claims in enabling human-in-the-loop ML. Our results show potential gaps between the current scope of VIS4ML research and aspirations for its use in practice. We find that while papers motivate that VIS4ML systems are applicable beyond the specific conditions studied, conclusions are often overfitted to non-representative scenarios, are based on interactions with a small set of ML experts and well-understood datasets, fail to acknowledge crucial dependencies, and hinge on decisions that lack justification. We discuss approaches to close the gap between aspirations and research claims and suggest documentation practices to report generality constraints that better acknowledge the exploratory nature of VIS4ML research.
LGNov 30, 2023
Pre-registration for Predictive ModelingJake M. Hofman, Angelos Chatzimparmpas, Amit Sharma et al.
Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling, spanning core machine learning tasks to various scientific applications, challenges such as overlooked contextual factors, data-dependent decision-making, and unintentional re-use of test data have raised questions about the integrity of results. To address these issues, we propose adapting pre-registration practices from explanatory modeling to predictive modeling. We discuss current best practices in predictive modeling and their limitations, introduce a lightweight pre-registration template, and present a qualitative study with machine learning researchers to gain insight into the effectiveness of pre-registration in preventing biased estimates and promoting more reliable research outcomes. We conclude by exploring the scope of problems that pre-registration can address in predictive modeling and acknowledging its limitations within this context.
CYAug 5, 2024
A Conceptual Framework for Ethical Evaluation of Machine Learning SystemsNeha R. Gupta, Jessica Hullman, Hari Subramonyam
Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations.
CYAug 21, 2023
Artificial Intelligence and Aesthetic JudgmentJessica Hullman, Ari Holtzman, Andrew Gelman
Generative AIs produce creative outputs in the style of human expression. We argue that encounters with the outputs of modern generative AI models are mediated by the same kinds of aesthetic judgments that organize our interactions with artwork. The interpretation procedure we use on art we find in museums is not an innate human faculty, but one developed over history by disciplines such as art history and art criticism to fulfill certain social functions. This gives us pause when considering our reactions to generative AI, how we should approach this new medium, and why generative AI seems to incite so much fear about the future. We naturally inherit a conundrum of causal inference from the history of art: a work can be read as a symptom of the cultural conditions that influenced its creation while simultaneously being framed as a timeless, seemingly acausal distillation of an eternal human condition. In this essay, we focus on an unresolved tension when we bring this dilemma to bear in the context of generative AI: are we looking for proof that generated media reflects something about the conditions that created it or some eternal human essence? Are current modes of interpretation sufficient for this task? Historically, new forms of art have changed how art is interpreted, with such influence used as evidence that a work of art has touched some essential human truth. As generative AI influences contemporary aesthetic judgment we outline some of the pitfalls and traps in attempting to scrutinize what AI generated media means.
LGApr 2
Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-TrainingDong Shu, Denghui Zhang, Jessica Hullman
Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.
AIFeb 17
This human study did not involve human subjects: Validating LLM simulations as behavioral evidenceJessica Hullman, David Broska, Huaman Sun et al.
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
AIFeb 23
ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-makingZiyang Guo, Yifan Wu, Jason Hartline et al.
Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.
AIJan 27, 2024
A Decision Theoretic Framework for Measuring AI RelianceZiyang Guo, Yifan Wu, Jason Hartline et al.
Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI's recommendation from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational decision-maker facing the same decision task as the behavioral decision-makers.
CLOct 17, 2024
Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis SimulationsAryan Shrivastava, Jessica Hullman, Max Lamparth · stanford
There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
AIFeb 10, 2025
The Value of Information in Human-AI Decision-makingZiyang Guo, Yifan Wu, Jason Hartline et al.
Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents requires knowing what information and strategies each agent employs. With a focus on human-AI pairings, we contribute a decision-theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to better exploit available information in AI-assisted decision workflows. We present a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to highlight human-complementing information. We validate the effectiveness of ACIV and ILIV-SHAP through a study of human-AI decision-making, and demonstrate the framework on examples from chest X-ray diagnosis and deepfake detection. We find that presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in error over non-AI assisted decisions more than vanilla SHAP.
LGMar 12, 2025
Conformal Prediction and Human Decision MakingJessica Hullman, Yifan Wu, Dawei Xie et al.
Methods to quantify uncertainty in predictions from arbitrary models are in demand in high-stakes domains like medicine and finance. Conformal prediction has emerged as a popular method for producing a set of predictions with specified average coverage, in place of a single prediction and confidence value. However, the value of conformal prediction sets to assist human decisions remains elusive due to the murky relationship between coverage guarantees and decision makers' goals and strategies. How should we think about conformal prediction sets as a form of decision support? We outline a decision theoretic framework for evaluating predictive uncertainty as informative signals, then contrast what can be said within this framework about idealized use of calibrated probabilities versus conformal prediction sets. Informed by prior empirical results and theories of human decisions under uncertainty, we formalize a set of possible strategies by which a decision maker might use a prediction set. We identify ways in which conformal prediction sets and posthoc predictive uncertainty quantification more broadly are in tension with common goals and needs in human-AI decision making. We give recommendations for future research in predictive uncertainty quantification to support human decision makers.
HCFeb 17, 2025
Characterizing Photorealism and Artifacts in Diffusion Model-Generated ImagesNegar Kamali, Karyn Nakamura, Aakriti Kumar et al.
Diffusion model-generated images can appear indistinguishable from authentic photographs, but these images often contain artifacts and implausibilities that reveal their AI-generated provenance. Given the challenge to public trust in media posed by photorealistic AI-generated images, we conducted a large-scale experiment measuring human detection accuracy on 450 diffusion-model generated images and 149 real images. Based on collecting 749,828 observations and 34,675 comments from 50,444 participants, we find that scene complexity of an image, artifact types within an image, display time of an image, and human curation of AI-generated images all play significant roles in how accurately people distinguish real from AI-generated images. Additionally, we propose a taxonomy characterizing artifacts often appearing in images generated by diffusion models. Our empirical observations and taxonomy offer nuanced insights into the capabilities and limitations of diffusion models to generate photorealistic images in 2024.
CLNov 19, 2024
A Computational Method for Measuring "Open Codes" in Qualitative AnalysisJohn Chen, Alexandros Lotsos, Sihan Cheng et al.
Qualitative analysis is critical to understanding human datasets in many social science disciplines. A central method in this process is inductive coding, where researchers identify and interpret codes directly from the datasets themselves. Yet, this exploratory approach poses challenges for meeting methodological expectations (such as ``depth'' and ``variation''), especially as researchers increasingly adopt Generative AI (GAI) for support. Ground-truth-based metrics are insufficient because they contradict the exploratory nature of inductive coding, while manual evaluation can be labor-intensive. This paper presents a theory-informed computational method for measuring inductive coding results from humans and GAI. Our method first merges individual codebooks using an LLM-enriched algorithm. It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence. Through two experiments on a human-coded online conversation dataset, we 1) reveal the merging algorithm's impact on metrics; 2) validate the metrics' stability and robustness across multiple runs and different LLMs; and 3) showcase the metrics' ability to diagnose coding issues, such as excessive or irrelevant (hallucinated) codes. Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
LGJul 7, 2025
Bridging Prediction and Intervention Problems in Social SystemsLydia T. Liu, Inioluwa Deborah Raji, Angela Zhou et al.
Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.
AIJun 28, 2025
Explanations are a means to an endJessica Hullman, Ziyang Guo, Berk Ustun
Modern methods for explainable machine learning are designed to describe how models map inputs to outputs--without deep consideration of how these explanations will be used in practice. This paper argues that explanations should be designed and evaluated with a specific end in mind. We describe how to formalize this end in a framework based in statistical decision theory. We show how this functionally-grounded approach can be applied across diverse use cases, such as clinical decision support, providing recourse, or debugging. We demonstrate its use to characterize the maximum "boost" in performance on a particular task that an explanation could provide an idealized decision-maker, preventing misuse due to ambiguity by forcing researchers to specify concrete use cases that can be analyzed in light of models of expected explanation use. We argue that evaluation should meld theoretical and empirical perspectives on the value of explanation, and contribute definitions that span these perspectives.
HCNov 3, 2024
Unexploited Information Value in Human-AI CollaborationZiyang Guo, Yifan Wu, Jason Hartline et al.
Humans and AIs are often paired on decision tasks with the expectation of achieving complementary performance -- where the combination of human and AI outperforms either one alone. However, how to improve performance of a human-AI team is often not clear without knowing more about what particular information and strategies each agent employs. In this paper, we propose a model based in statistical decision theory to analyze human-AI collaboration from the perspective of what information could be used to improve a human or AI decision. We demonstrate our model on a deepfake detection task to investigate seven video-level features by their unexploited value of information. We compare the human alone, AI alone and human-AI team and offer insights on how the AI assistance impacts people's usage of the information and what information that the AI exploits well might be useful for improving human decisions.
HCFeb 28, 2025
Seeing Eye to AI? Applying Deep-Feature-Based Similarity Metrics to Information VisualizationSheng Long, Angelos Chatzimparmpas, Emma Alexander et al.
Judging the similarity of visualizations is crucial to various applications, such as visualization-based search and visualization recommendation systems. Recent studies show deep-feature-based similarity metrics correlate well with perceptual judgments of image similarity and serve as effective loss functions for tasks like image super-resolution and style transfer. We explore the application of such metrics to judgments of visualization similarity. We extend a similarity metric using five ML architectures and three pre-trained weight sets. We replicate results from previous crowd-sourced studies on scatterplot and visual channel similarity perception. Notably, our metric using pre-trained ImageNet weights outperformed gradient-descent tuned MS-SSIM, a multi-scale similarity metric based on luminance, contrast, and structure. Our work contributes to understanding how deep-feature-based metrics can enhance similarity assessments in visualization, potentially improving visual analysis tools and techniques. Supplementary materials are available at https://osf.io/dj2ms.
HCJun 12, 2024
How to Distinguish AI-Generated Images from Authentic PhotographsNegar Kamali, Karyn Nakamura, Angelos Chatzimparmpas et al.
The high level of photorealism in state-of-the-art diffusion models like Midjourney, Stable Diffusion, and Firefly makes it difficult for untrained humans to distinguish between real photographs and AI-generated images. To address this problem, we designed a guide to help readers develop a more critical eye toward identifying artifacts, inconsistencies, and implausibilities that often appear in AI-generated images. The guide is organized into five categories of artifacts and implausibilities: anatomical, stylistic, functional, violations of physics, and sociocultural. For this guide, we generated 138 images with diffusion models, curated 9 images from social media, and curated 42 real photographs. These images showcase the kinds of cues that prompt suspicion towards the possibility an image is AI-generated and why it is often difficult to draw conclusions about an image's provenance without any context beyond the pixels in an image. Human-perceptible artifacts are not always present in AI-generated images, but this guide reveals artifacts and implausibilities that often emerge. By drawing attention to these kinds of artifacts and implausibilities, we aim to better equip people to distinguish AI-generated images from real photographs in the future.
HCJan 25, 2024
Underspecified Human Decision Experiments Considered HarmfulJessica Hullman, Alex Kale, Jason Hartline
Decision-making with information displays is a key focus of research in areas like human-AI collaboration and data visualization. However, what constitutes a decision problem, and what is required for an experiment to conclude that decisions are flawed, remain imprecise. We present a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics. We claim that to attribute loss in human performance to bias, an experiment must provide the information that a rational agent would need to identify the normative decision. We evaluate whether recent empirical research on AI-assisted decisions achieves this standard. We find that only 10 (26%) of 39 studies that claim to identify biased behavior presented participants with sufficient information to make this claim in at least one treatment condition. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow to be conceived.
HCJan 16, 2024
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image LabelingDongping Zhang, Angelos Chatzimparmpas, Negar Kamali et al.
As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the presentation of conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets offer some advantage in assisting humans in labeling out-of-distribution (OOD) images in the setting that we studied, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.
CRJan 16, 2022
Visualizing Privacy-Utility Trade-Offs in Differentially Private Data ReleasesPriyanka Nanayakkara, Johes Bater, Xi He et al.
Organizations often collect private data and release aggregate statistics for the public's benefit. If no steps toward preserving privacy are taken, adversaries may use released statistics to deduce unauthorized information about the individuals described in the private dataset. Differentially private algorithms address this challenge by slightly perturbing underlying statistics with noise, thereby mathematically limiting the amount of information that may be deduced from each data release. Properly calibrating these algorithms -- and in turn the disclosure risk for people described in the dataset -- requires a data curator to choose a value for a privacy budget parameter, $ε$. However, there is little formal guidance for choosing $ε$, a task that requires reasoning about the probabilistic privacy-utility trade-off. Furthermore, choosing $ε$ in the context of statistical inference requires reasoning about accuracy trade-offs in the presence of both measurement error and differential privacy (DP) noise. We present Visualizing Privacy (ViP), an interactive interface that visualizes relationships between $ε$, accuracy, and disclosure risk to support setting and splitting $ε$ among queries. As a user adjusts $ε$, ViP dynamically updates visualizations depicting expected accuracy and risk. ViP also has an inference setting, allowing a user to reason about the impact of DP noise on statistical inferences. Finally, we present results of a study where 16 research practitioners with little to no DP background completed a set of tasks related to setting $ε$ using both ViP and a control. We find that ViP helps participants more correctly answer questions related to judging the probability of where a DP-noised release is likely to fall and comparing between DP-noised and non-private confidence intervals.
HCAug 22, 2021
Visualizing Uncertainty in Probabilistic Graphs with Network Hypothetical Outcome Plots (NetHOPs)Dongping Zhang, Eytan Adar, Jessica Hullman
Probabilistic graphs are challenging to visualize using the traditional node-link diagram. Encoding edge probability using visual variables like width or fuzziness makes it difficult for users of static network visualizations to estimate network statistics like densities, isolates, path lengths, or clustering under uncertainty. We introduce Network Hypothetical Outcome Plots (NetHOPs), a visualization technique that animates a sequence of network realizations sampled from a network distribution defined by probabilistic edges. NetHOPs employ an aggregation and anchoring algorithm used in dynamic and longitudinal graph drawing to parameterize layout stability for uncertainty estimation. We present a community matching algorithm to enable visualizing the uncertainty of cluster membership and community occurrence. We describe the results of a study in which 51 network experts used NetHOPs to complete a set of common visual analysis tasks and reported how they perceived network structures and properties subject to uncertainty. Participants' estimates fell, on average, within 11% of the ground truth statistics, suggesting NetHOPs can be a reasonable approach for enabling network analysts to reason about multiple properties under uncertainty. Participants appeared to articulate the distribution of network statistics slightly more accurately when they could manipulate the layout anchoring and the animation speed. Based on these findings, we synthesize design recommendations for developing and using animated visualizations for probabilistic networks.
HCAug 10, 2021
Visualization EquilibriumPaula Kayongo, Glenn Sun, Jason Hartline et al.
In many real-world strategic settings, people use information displays to make decisions. In these settings, an information provider chooses which information to provide to strategic agents and how to present it, and agents formulate a best response based on the information and their anticipation of how others will behave. We contribute the results of a controlled online experiment to examine how the provision and presentation of information impacts people's decisions in a congestion game. Our experiment compares how different visualization approaches for displaying this information, including bar charts and hypothetical outcome plots, and different information conditions, including where the visualized information is private versus public (i.e., available to all agents), affect decision making and welfare. We characterize the effects of visualization anticipation, referring to changes to behavior when an agent goes from alone having access to a visualization to knowing that others also have access to the visualization to guide their decisions. We also empirically identify the visualization equilibrium, i.e., the visualization for which the visualized outcome of agents' decisions matches the realized decisions of the agents who view it. We reflect on the implications of visualization equilibria and visualization anticipation for designing information displays for real-world strategic settings.
HCJul 28, 2021
Causal Support: Modeling Causal Inferences with VisualizationsAlex Kale, Yifan Wu, Jessica Hullman
Analysts often make visual causal inferences about possible data-generating models. However, visual analytics (VA) software tends to leave these models implicit in the mind of the analyst, which casts doubt on the statistical validity of informal visual "insights". We formally evaluate the quality of causal inferences from visualizations by adopting causal support -- a Bayesian cognition model that learns the probability of alternative causal explanations given some data -- as a normative benchmark for causal inferences. We contribute two experiments assessing how well crowdworkers can detect (1) a treatment effect and (2) a confounding relationship. We find that chart users' causal inferences tend to be insensitive to sample size such that they deviate from our normative benchmark. While interactively cross-filtering data in visualizations can improve sensitivity, on average users do not perform reliably better with common visualizations than they do with textual contingency tables. These experiments demonstrate the utility of causal support as an evaluation framework for inferences in VA and point to opportunities to make analysts' mental models more explicit in VA software.
HCJul 16, 2021
An Automated Approach to Reasoning About Task-Oriented Insights in Responsive VisualizationHyeok Kim, Ryan Rossi, Abhraneel Sarma et al.
Authors often transform a large screen visualization for smaller displays through rescaling, aggregation and other techniques when creating visualizations for both desktop and mobile devices (i.e., responsive visualization). However, transformations can alter relationships or patterns implied by the large screen view, requiring authors to reason carefully about what information to preserve while adjusting their design for the smaller display. We propose an automated approach to approximating the loss of support for task-oriented visualization insights (identification, comparison, and trend) in responsive transformation of a source visualization. We operationalize identification, comparison, and trend loss as objective functions calculated by comparing properties of the rendered source visualization to each realized target (small screen) visualization. To evaluate the utility of our approach, we train machine learning models on human ranked small screen alternative visualizations across a set of source visualizations. We find that our approach achieves an accuracy of 84% (random forest model) in ranking visualizations. We demonstrate this approach in a prototype responsive visualization recommender that enumerates responsive transformations using Answer Set Programming and evaluates the preservation of task-oriented insights using our loss measures. We discuss implications of our approach for the development of automated and semi-automated responsive visualization recommendation.
HCApr 15, 2021
Design Patterns and Trade-Offs in Responsive Visualization for CommunicationHyeok Kim, Dominik Moritz, Jessica Hullman
Increased access to mobile devices motivates the need to design communicative visualizations that are responsive to varying screen sizes. However, relatively little design guidance or tooling is currently available to authors. We contribute a detailed characterization of responsive visualization strategies in communication-oriented visualizations, identifying 76 total strategies by analyzing 378 pairs of large screen (LS) and small screen (SS) visualizations from online articles and reports. Our analysis distinguishes between the Targets of responsive visualization, referring to what elements of a design are changed and Actions representing how targets are changed. We identify key trade-offs related to authors' need to maintain graphical density, referring to the amount of information per pixel, while also maintaining the "message" or intended takeaways for users of a visualization. We discuss implications of our findings for future visualization tool design to support responsive transformation of visualization designs, including requirements for automated recommenders for communication-oriented responsive visualizations.
HCApr 5, 2021
To design interfaces for exploratory data analysis, we need theories of graphical inferenceJessica Hullman, Andrew Gelman
Research and development in computer science and statistics have produced increasingly sophisticated software interfaces for interactive and exploratory analysis, optimized for easy pattern finding and data exposure. But design philosophies that emphasize exploration over other phases of analysis risk confusing a need for flexibility with a conclusion that exploratory visual analysis is inherently model-free and cannot be formalized. We describe how without a grounding in theories of human statistical inference, research in exploratory visual analysis can lead to contradictory interface objectives and representations of uncertainty that can discourage users from drawing valid inferences. We discuss how the concept of a model check in a Bayesian statistical framework unites exploratory and confirmatory analysis, and how this understanding relates to other proposed theories of graphical inference. Viewing interactive analysis as driven by model checks suggests new directions for software and empirical research around exploratory and visual analysis. For example, systems should enable specifying and explicitly comparing data to null and other reference distributions and better representations of uncertainty. Implications of Bayesian and other theories of graphical inference should be tested against outcomes of interactive analysis by people to drive theory development.
HCAug 1, 2020
Bayesian-Assisted Inference from Visualized DataYea-Seul Kim, Paula Kayongo, Madeleine Grunde-McLaughlin et al.
A Bayesian view of data interpretation suggests that a visualization user should update their existing beliefs about a parameter's value in accordance with the amount of information about the parameter value captured by the new observations. Extending recent work applying Bayesian models to understand and evaluate belief updating from visualizations, we show how the predictions of Bayesian inference can be used to guide more rational belief updating. We design a Bayesian inference-assisted uncertainty analogy that numerically relates uncertainty in observed data to the user's subjective uncertainty, and a posterior visualization that prescribes how a user should update their beliefs given their prior beliefs and the observed data. In a pre-registered experiment on 4,800 people, we find that when a newly observed data sample is relatively small (N=158), both techniques reliably improve people's Bayesian updating on average compared to the current best practice of visualizing uncertainty in the observed data. For large data samples (N=5208), where people's updated beliefs tend to deviate more strongly from the prescriptions of a Bayesian model, we find evidence that the effectiveness of the two forms of Bayesian assistance may depend on people's proclivity toward trusting the source of the data. We discuss how our results provide insight into individual processes of belief updating and subjective uncertainty, and how understanding these aspects of interpretation paves the way for more sophisticated interactive visualizations for analysis and communication.
HCJul 28, 2020
Visual Reasoning Strategies for Effect Size Judgments and DecisionsAlex Kale, Matthew Kay, Jessica Hullman
Uncertainty visualizations often emphasize point estimates to support magnitude estimates or decisions through visual comparison. However, when design choices emphasize means, users may overlook uncertainty information and misinterpret visual distance as a proxy for effect size. We present findings from a mixed design experiment on Mechanical Turk which tests eight uncertainty visualization designs: 95% containment intervals, hypothetical outcome plots, densities, and quantile dotplots, each with and without means added. We find that adding means to uncertainty visualizations has small biasing effects on both magnitude estimation and decision-making, consistent with discounting uncertainty. We also see that visualization designs that support the least biased effect size estimation do not support the best decision-making, suggesting that a chart user's sense of effect size may not necessarily be identical when they use the same information for different tasks. In a qualitative analysis of users' strategy descriptions, we find that many users switch strategies and do not employ an optimal strategy when one exists. Uncertainty visualizations which are optimally designed in theory may not be the most effective in practice because of the ways that users satisfice with heuristics, suggesting opportunities to better understand visualization effectiveness by modeling sets of potential strategies.
HCApr 23, 2020
Human Factors in Model Interpretability: Industry Practices, Challenges, and NeedsSungsoo Ray Hong, Jessica Hullman, Enrico Bertini
As the use of machine learning (ML) models in product development and data-driven decision-making processes became pervasive in many domains, people's focus on building a well-performing model has increasingly shifted to understanding how their model works. While scholarly interest in model interpretability has grown rapidly in research communities like HCI, ML, and beyond, little is known about how practitioners perceive and aim to provide interpretability in the context of their existing workflows. This lack of understanding of interpretability as practiced may prevent interpretability research from addressing important needs, or lead to unrealistic solutions. To bridge this gap, we conducted 22 semi-structured interviews with industry practitioners to understand how they conceive of and design for interpretability while they plan, build, and use their models. Based on a qualitative analysis of our results, we differentiate interpretability roles, processes, goals and strategies as they exist within organizations making heavy use of ML models. The characterization of interpretability work that emerges from our analysis suggests that model interpretability frequently involves cooperation and mental model comparison between people in different roles, often aimed at building trust not only between people and models but also between people within the organization. We present implications for design that discuss gaps between the interpretability challenges that practitioners face in their practice and approaches proposed in the literature, highlighting possible research directions that can better address real-world needs.
HCAug 5, 2019
Why Authors Don't Visualize UncertaintyJessica Hullman
Clear presentation of uncertainty is an exception rather than rule in media articles, data-driven reports, and consumer applications, despite proposed techniques for communicating sources of uncertainty in data. This work considers, Why do so many visualization authors choose not to visualize uncertainty? I contribute a detailed characterization of practices, associations, and attitudes related to uncertainty communication among visualization authors, derived from the results of surveying 90 authors who regularly create visualizations for others as part of their work, and interviewing thirteen influential visualization designers. My results highlight challenges that authors face and expose assumptions and inconsistencies in beliefs about the role of uncertainty in visualization. In particular, a clear contradiction arises between authors' acknowledgment of the value of depicting uncertainty and the norm of omitting direct depiction of uncertainty. To help explain this contradiction, I present a rhetorical model of uncertainty omission in visualization-based communication. I also adapt a formal statistical model of how viewers judge the strength of a signal in a visualization to visualization-based communication, to argue that uncertainty communication necessarily reduces degrees of freedom in viewers' statistical inferences. I conclude with recommendations for how visualization research on uncertainty communication could better serve practitioners' current needs and values while deepening understanding of assumptions that reinforce uncertainty omission.
HCAug 1, 2019
Illusion of Causality in Visualized DataCindy Xiong, Joel Shapiro, Jessica Hullman et al.
Students who eat breakfast more frequently tend to have a higher grade point average. From this data, many people might confidently state that a before-school breakfast program would lead to higher grades. This is a reasoning error, because correlation does not necessarily indicate causation -- X and Y can be correlated without one directly causing the other. While this error is pervasive, its prevalence might be amplified or mitigated by the way that the data is presented to a viewer. Across three crowdsourced experiments, we examined whether how simple data relations are presented would mitigate this reasoning error. The first experiment tested examples similar to the breakfast-GPA relation, varying in the plausibility of the causal link. We asked participants to rate their level of agreement that the relation was correlated, which they rated appropriately as high. However, participants also expressed high agreement with a causal interpretation of the data. Levels of support for the causal interpretation were not equally strong across visualization types: causality ratings were highest for text descriptions and bar graphs, but weaker for scatter plots. But is this effect driven by bar graphs aggregating data into two groups or by the visual encoding type? We isolated data aggregation versus visual encoding type and examined their individual effect on perceived causality. Overall, different visualization designs afford different cognitive reasoning affordances across the same data. High levels of data aggregation by graphs tend to be associated with higher perceived causality in data. Participants perceived line and dot visual encodings as more causal than bar encodings. Our results demonstrate how some visualization designs trigger stronger causal links while choosing others can help mitigate unwarranted perceptions of causality.
HCJan 9, 2019
Decision-Making Under Uncertainty in Research Synthesis: Designing for the Garden of Forking PathsAlex Kale, Matthew Kay, Jessica Hullman
To make evidence-based recommendations to decision-makers, researchers conducting systematic reviews and meta-analyses must navigate a garden of forking paths: a series of analytical decision-points, each of which has the potential to influence findings. To identify challenges and opportunities related to designing systems to help researchers manage uncertainty around which of multiple analyses is best, we interviewed 11 professional researchers who conduct research synthesis to inform decision-making within three organizations. We conducted a qualitative analysis identifying 480 analytical decisions made by researchers throughout the scientific process. We present descriptions of current practices in applied research synthesis and corresponding design challenges: making it more feasible for researchers to try and compare analyses, shifting researchers' attention from rationales for decisions to impacts on results, and supporting communication techniques that acknowledge decision-makers' aversions to uncertainty. We identify opportunities to design systems which help researchers explore, reason about, and communicate uncertainty in decision-making about possible analyses in research synthesis.
HCJan 9, 2019
A Bayesian Cognition Approach to Improve Data VisualizationYea-Seul Kim, Logan A Walls, Peter Krafft et al.
People naturally bring their prior beliefs to bear on how they interpret the new information, yet few formal models exist for accounting for the influence of users' prior beliefs in interactions with data presentations like visualizations. We demonstrate a Bayesian cognitive model for understanding how people interpret visualizations in light of prior beliefs and show how this model provides a guide for improving visualization evaluation. In a first study, we show how applying a Bayesian cognition model to a simple visualization scenario indicates that people's judgments are consistent with a hypothesis that they are doing approximate Bayesian inference. In a second study, we evaluate how sensitive our observations of Bayesian behavior are to different techniques for eliciting people subjective distributions, and to different datasets. We find that people don't behave consistently with Bayesian predictions for large sample size datasets, and this difference cannot be explained by elicitation technique. In a final study, we show how normative Bayesian inference can be used as an evaluation framework for visualizations, including of uncertainty.
HCNov 22, 2016
Leveraging Citation Networks to Visualize Scholarly Influence Over TimeJason Portenoy, Jessica Hullman, Jevin D. West
Assessing the influence of a scholar's work is an important task for funding organizations, academic departments, and researchers. Common methods, such as measures of citation counts, can ignore much of the nuance and multidimensionality of scholarly influence. We present an approach for generating dynamic visualizations of scholars' careers. This approach uses an animated node-link diagram showing the citation network accumulated around the researcher over the course of the career in concert with key indicators, highlighting influence both within and across fields. We developed our design in collaboration with one funding organization---the Pew Biomedical Scholars program---but the methods are generalizable to visualizations of scholarly influence. We applied the design method to the Microsoft Academic Graph, which includes more than 120 million publications. We validate our abstractions throughout the process through collaboration with the Pew Biomedical Scholars program officers and summative evaluations with their scholars.