HCMay 13, 2022Code
Exploring How Machine Learning Practitioners (Try To) Use Fairness ToolkitsWesley Hanwen Deng, Manish Nagireddy, Michelle Seng Ah Lee et al. · cmu
Recent years have seen the development of many open-source ML fairness toolkits aimed at helping ML practitioners assess and address unfairness in their systems. However, there has been little research investigating how ML practitioners actually use these toolkits in practice. In this paper, we conducted the first in-depth empirical exploration of how industry practitioners (try to) work with existing fairness toolkits. In particular, we conducted think-aloud interviews to understand how participants learn about and use fairness toolkits, and explored the generality of our findings through an anonymous online survey. We identified several opportunities for fairness toolkits to better address practitioner needs and scaffold them in using toolkits effectively and responsibly. Based on these findings, we highlight implications for the design of future open-source fairness toolkits that can support practitioners in better contextualizing, communicating, and collaborating around ML fairness efforts.
CLJul 19, 2023
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMsTongshuang Wu, Haiyi Zhu, Maya Albayrak et al. · cmu
LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these ``human computation algorithms,'' but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks. We reflect on human and LLMs' different sensitivities to instructions, stress the importance of enabling human-facing safeguards for LLMs, and discuss the potential of training humans and LLMs with complementary skill sets. Crucially, we show that replicating crowdsourcing pipelines offers a valuable platform to investigate 1) the relative LLM strengths on different tasks (by cross-comparing their performances on sub-tasks) and 2) LLMs' potential in complex tasks, where they can complete part of the tasks while leaving others to humans.
HCApr 5, 2022
Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision SupportAnna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng et al. · cmu
AI-based decision support tools (ADS) are increasingly used to augment human decision-making in high-stakes, social contexts. As public sector agencies begin to adopt ADS, it is critical that we understand workers' experiences with these systems in practice. In this paper, we present findings from a series of interviews and contextual inquiries at a child welfare agency, to understand how they currently make AI-assisted child maltreatment screening decisions. Overall, we observe how workers' reliance upon the ADS is guided by (1) their knowledge of rich, contextual information beyond what the AI model captures, (2) their beliefs about the ADS's capabilities and limitations relative to their own, (3) organizational pressures and incentives around the use of the ADS, and (4) awareness of misalignments between algorithmic predictions and their own decision-making objectives. Drawing upon these findings, we discuss design implications towards supporting more effective human-AI decision-making.
LGJun 30, 2022
A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making AlgorithmsAmanda Coston, Anna Kawakami, Haiyi Zhu et al. · cmu
Recent research increasingly brings to question the appropriateness of using predictive tools in complex, real-world tasks. While a growing body of work has explored ways to improve value alignment in these tools, comparatively less work has centered concerns around the fundamental justifiability of using these tools. This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We apply the lens of validity to re-examine common challenges in problem formulation and data issues that jeopardize the justifiability of using predictive algorithms and connect these challenges to the social science discourse around validity. Our interdisciplinary exposition clarifies how these concepts apply to algorithmic decision making contexts. We demonstrate how these validity considerations could distill into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data.
HCAug 30, 2023
Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human KnowledgeAnna Kawakami, Luke Guerdan, Yanghuidi Cheng et al. · cmu
A growing body of research has explored how to support humans in making better use of AI-based decision support, including via training and onboarding. Existing research has focused on decision-making tasks where it is possible to evaluate "appropriate reliance" by comparing each decision against a ground truth label that cleanly maps to both the AI's predictive target and the human decision-maker's goals. However, this assumption does not hold in many real-world settings where AI tools are deployed today (e.g., social work, criminal justice, and healthcare). In this paper, we introduce a process-oriented notion of appropriate reliance called critical use that centers the human's ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model. To explore how training can support critical use, we conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening. We find that, by providing participants with accelerated, low-stakes opportunities to practice AI-assisted decision-making in this setting, novices came to exhibit patterns of disagreement with AI that resemble those of experienced workers. A qualitative examination of participants' explanations for their AI-assisted decisions revealed that they drew upon qualitative case narratives, to which the AI model did not have access, to learn when (not) to rely on AI predictions. Our findings open new questions for the study and design of training for real-world AI-assisted decision-making.
HCMay 17
Understanding, Challenging, and Demystifying Perceptions of Gig Worker VulnerabilitiesSander de Jong, Jane Hsieh, Tzu-Sheng Kuo et al.
Across service domains, platform-based gig workers often face a wide range of severe yet hidden vulnerabilities, including opaque pay practices, illusions of flexibility, health and safety risks, and privacy violations. To the general public and inexperienced workers such latent vulnerabilities remain largely unknown and concealed by intentional platform design that gives illusions of adequate labor protections, or $\textit{myths}$. This study examines how workers perceive (and shift their beliefs away from) five commonly held misconceptions regarding gig worker vulnerabilities. In $Phase~I$, crowdworkers ($N~=~236$) rated their agreement with five common myths surrounding vulnerabilities in gig work:$~227$ of them believed one or more myth(s). In $Phase~II$, we challenged these workers to defend their views by presenting an expert- or LLM-generated counterargument. Our findings show workers' underexposure to personal and shared vulnerabilities of gig work, revealing a knowledge gap where persuasive interventions can scalably raise awareness around such hidden labor conditions. We reflect on the effectiveness of different persuasion strategies and discuss implications for promoting more accurate public perceptions that support collective bargaining of workers' rights.
HCMar 26, 2023
Recentering Validity Considerations through Early-Stage Deliberations Around AI and Policy DesignAnna Kawakami, Amanda Coston, Haiyi Zhu et al. · cmu
AI-based decision-making tools are rapidly spreading across a range of real-world, complex domains like healthcare, criminal justice, and child welfare. A growing body of research has called for increased scrutiny around the validity of AI system designs. However, in real-world settings, it is often not possible to fully address questions around the validity of an AI tool without also considering the design of associated organizational and public policies. Yet, considerations around how an AI tool may interface with policy are often only discussed retrospectively, after the tool is designed or deployed. In this short position paper, we discuss opportunities to promote multi-stakeholder deliberations around the design of AI-based technologies and associated policies, at the earliest stages of a new project.
HCMay 18, 2022
Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholdersLogan Stapleton, Min Hun Lee, Diana Qing et al.
Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ought to be used in the first place. In this work, we conducted a set of seven design workshops with 35 stakeholders who have been impacted by the child welfare system or who work in it to understand their beliefs and concerns around PRMs, and to engage them in imagining new uses of data and technologies in the child welfare system. We found that participants worried current PRMs perpetuate or exacerbate existing problems in child welfare. Participants suggested new ways to use data and data-driven tools to better support impacted communities and suggested paths to mitigate possible harms of these tools. Participants also suggested low-tech or no-tech alternatives to PRMs to address problems in child welfare. Our study sheds light on how researchers and designers can work in solidarity with impacted communities, possibly to circumvent or oppose child welfare agencies.
HCMar 17, 2023
Understanding Frontline Workers' and Unhoused Individuals' Perspectives on AI Used in Homeless ServicesTzu-Sheng Kuo, Hong Shen, Jisoo Geum et al.
Recent years have seen growing adoption of AI-based decision-support systems (ADS) in homeless services, yet we know little about stakeholder desires and concerns surrounding their use. In this work, we aim to understand impacted stakeholders' perspectives on a deployed ADS that prioritizes scarce housing resources. We employed AI lifecycle comicboarding, an adapted version of the comicboarding method, to elicit stakeholder feedback and design ideas across various components of an AI system's design. We elicited feedback from county workers who operate the ADS daily, service providers whose work is directly impacted by the ADS, and unhoused individuals in the region. Our participants shared concerns and design suggestions around the AI system's overall objective, specific model design choices, dataset selection, and use in deployment. Our findings demonstrate that stakeholders, even without AI knowledge, can provide specific and critical feedback on an AI system's design and deployment, if empowered to do so.
HCAug 9, 2023
Organizational Bulk Email Systems: Their Role and Performance in Remote WorkRuoyan Kong, Haiyi Zhu, Joseph A. Konstan
The COVID-19 pandemic has forced many employees to work from home. Organizational bulk emails now play a critical role to reach employees with central information in this work-from-home environment. However, we know from our own recent work that organizational bulk email has problems: recipients fail to retain the bulk messages they received from the organization; recipients and senders have different opinions on which bulk messages were important; and communicators lack technology support to better target and design messages. In this position paper, first we review the prior work on evaluating, designing, and prototyping organizational communication systems. Second we review our recent findings and some research techniques we found useful in studying organizational communication. Last we propose a research agenda to study organizational communications in remote work environment and suggest some key questions and potential study directions.
HCApr 15
"I Just Don't Want My Work Being Fed Into The AI Blender": Queer Artists on Refusing and Resisting Generative AIJordan Taylor, Joel Mire, Alicia DeVrio et al. · cmu
Art-making is a collective social activity through which queer people engage in political resistance, develop identities, archive queer memory, and form community. However, in recent years, generative AI has disrupted queer artistic communities. Through 15 semi-structured interviews, we examine how queer artists are making sense of the encroachment of GenAI into their art worlds. Our findings surface significant tensions between the relationality of our participants' queer art practices and the perceived anti-relationality of GenAI development and use. We detail how our participants refuse and resist GenAI use and development in response and highlight the limited role our participants saw for GenAI within art-making, such as the queer aesthetic potential of surreal image models. Drawing on queer theory, we discuss how CSCW researchers might support queer artists by refusing dominant AI imaginaries and supporting queer world-building.
LGApr 21, 2022
A Sandbox Tool to Bias(Stress)-Test Fairness AlgorithmsNil-Jana Akpinar, Manish Nagireddy, Logan Stapleton et al.
Motivated by the growing importance of reducing unfairness in ML predictions, Fair-ML researchers have presented an extensive suite of algorithmic 'fairness-enhancing' remedies. Most existing algorithms, however, are agnostic to the sources of the observed unfairness. As a result, the literature currently lacks guiding frameworks to specify conditions under which each algorithmic intervention can potentially alleviate the underpinning cause of unfairness. To close this gap, we scrutinize the underlying biases (e.g., in the training data or design choices) that cause observational unfairness. We present the conceptual idea and a first implementation of a bias-injection sandbox tool to investigate fairness consequences of various biases and assess the effectiveness of algorithmic remedies in the presence of specific types of bias. We call this process the bias(stress)-testing of algorithmic interventions. Unlike existing toolkits, ours provides a controlled environment to counterfactually inject biases in the ML pipeline. This stylized setup offers the distinct capability of testing fairness interventions beyond observational data and against an unbiased benchmark. In particular, we can test whether a given remedy can alleviate the injected bias by comparing the predictions resulting after the intervention in the biased setting with true labels in the unbiased regime-that is, before any bias injection. We illustrate the utility of our toolkit via a proof-of-concept case study on synthetic data. Our empirical analysis showcases the type of insights that can be obtained through our simulations.
HCMar 20, 2023
Agent-based Simulation for Online Mental Health MatchingYuhan Liu, Anna Fang, Glen Moriarty et al.
Online mental health communities (OMHCs) are an effective and accessible channel to give and receive social support for individuals with mental and emotional issues. However, a key challenge on these platforms is finding suitable partners to interact with given that mechanisms to match users are currently underdeveloped. In this paper, we collaborate with one of the world's largest OMHC to develop an agent-based simulation framework and explore the trade-offs in different matching algorithms. The simulation framework allows us to compare current mechanisms and new algorithmic matching policies on the platform, and observe their differing effects on a variety of outcome metrics. Our findings include that usage of the deferred-acceptance algorithm can significantly better the experiences of support-seekers in one-on-one chats while maintaining low waiting time. We note key design considerations that agent-based modeling reveals in the OMHC context, including the potential benefits of algorithmic matching on marginalized communities.
CYApr 2
Simulating Couple Conflict: Designing A Multi-Agent System for Therapy Training and PracticeCanwen Wang, Angela Chen, Catherine Bao et al.
Couples therapy requires managing complex, evolving emotional dynamics between partners, but traditional training methods for therapists, like role-play, lack realism, consistency, and control. We present a multi-modal simulation that models therapy as a controlled, multi-agent dynamical system with structured interaction stages. Therapists practice with a pair of client-agents who go through six evolving stages that respond to therapist actions. This simulation enables practice with demand-withdraw conflict patterns in a closed-loop environment. The simulation uses a sense-plan-act architecture: it detects the therapist's input, updates agents' interaction states based on psychotherapy theory and transcript analysis, and generates realistic verbal and emotional responses. In an experiment with 21 licensed U.S. therapists, participants more accurately identified state transitions and rated the system as more realistic and responsive than a prompt-based baseline, demonstrating the value of stateful, interpretable simulation for therapist training.
HCFeb 21, 2024
Wikibench: Community-Driven Data Curation for AI Evaluation on WikipediaTzu-Sheng Kuo, Aaron Halfaker, Zirui Cheng et al.
AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.
HCMay 21, 2024
Studying Up Public Sector AI: How Networks of Power Relations Shape Agency Decisions Around AI Design and UseAnna Kawakami, Amanda Coston, Hoda Heidari et al. · cmu
As public sector agencies rapidly introduce new AI tools in high-stakes domains like social services, it becomes critical to understand how decisions to adopt these tools are made in practice. We borrow from the anthropological practice to ``study up'' those in positions of power, and reorient our study of public sector AI around those who have the power and responsibility to make decisions about the role that AI tools will play in their agency. Through semi-structured interviews and design activities with 16 agency decision-makers, we examine how decisions about AI design and adoption are influenced by their interactions with and assumptions about other actors within these agencies (e.g., frontline workers and agency leaders), as well as those above (legal systems and contracted companies), and below (impacted communities). By centering these networks of power relations, our findings shed light on how infrastructural, legal, and social factors create barriers and disincentives to the involvement of a broader range of stakeholders in decisions about AI design and adoption. Agency decision-makers desired more practical support for stakeholder involvement around public sector AI to help overcome the knowledge and power differentials they perceived between them and other stakeholders (e.g., frontline workers and impacted community members). Building on these findings, we discuss implications for future research and policy around actualizing participatory AI approaches in public sector contexts.
CLMay 31, 2025
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication TrainingKeyeun Lee, Seolhee Lee, Esther Hehsun Kim et al.
Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative--yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.
HCMar 12, 2025
Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI ModelsJordan Taylor, Joel Mire, Franchesca Spektor et al. · allen-ai, cmu
Queer people are often discussed as targets of bias, harm, or discrimination in research on generative AI. However, the specific ways that queer people engage with generative AI, and thus possible uses that support queer people, have yet to be explored. We conducted a workshop study with 13 queer artists, during which we gave participants access to GPT-4 and DALL-E 3 and facilitated group sensemaking activities. We found our participants struggled to use these models due to various normative values embedded in their designs, such as hyper-positivity and anti-sexuality. We describe various strategies our participants developed to overcome these models' limitations and how, nevertheless, our participants found value in these highly-normative technologies. Drawing on queer feminist theory, we discuss implications for the conceptualization of "state-of-the-art" models and consider how FAccT researchers might support queer alternatives.
CVApr 18, 2025
POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image GenerationEvans Xu Han, Alice Qian Zhang, Haiyi Zhu et al.
State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET's ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.
HCJan 14
The Algorithmic Gaze: An Audit and Ethnography of the LAION-Aesthetics Predictor ModelJordan Taylor, William Agnew, Maarten Sap et al.
Visual generative AI models are trained using a one-size-fits-all measure of aesthetic appeal. However, what is deemed "aesthetic" is inextricably linked to personal taste and cultural values, raising the question of whose taste is represented in visual generative AI models. In this work, we study an aesthetic evaluation model--LAION Aesthetic Predictor (LAP)--that is widely used to curate datasets to train visual generative image models, like Stable Diffusion, and evaluate the quality of AI-generated images. To understand what LAP measures, we audited the model across three datasets. First, we examined the impact of aesthetic filtering on the LAION-Aesthetics Dataset (approximately 1.2B images), which was curated from LAION-5B using LAP. We find that the LAP disproportionally filters in images with captions mentioning women, while filtering out images with captions mentioning men or LGBTQ+ people. Then, we used LAP to score approximately 330k images across two art datasets, finding the model rates realistic images of landscapes, cityscapes, and portraits from western and Japanese artists most highly. In doing so, the algorithmic gaze of this aesthetic evaluation model reinforces the imperial and male gazes found within western art history. In order to understand where these biases may have originated, we performed a digital ethnography of public materials related to the creation of LAP. We find that the development of LAP reflects the biases we found in our audits, such as the aesthetic scores used to train LAP primarily coming from English-speaking photographers and western AI-enthusiasts. In response, we discuss how aesthetic evaluation can perpetuate representational harms and call on AI developers to shift away from prescriptive measures of "aesthetics" toward more pluralistic evaluation.
HCMay 30, 2023
Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial UsesLogan Stapleton, Jordan Taylor, Sarah Fox et al.
Large generative AI models (GMs) like GPT and DALL-E are trained to generate content for general, wide-ranging purposes. GM content filters are generalized to filter out content which has a risk of harm in many cases, e.g., hate speech. However, prohibited content is not always harmful -- there are instances where generating prohibited content can be beneficial. So, when GMs filter out content, they preclude beneficial use cases along with harmful ones. Which use cases are precluded reflects the values embedded in GM content filtering. Recent work on red teaming proposes methods to bypass GM content filters to generate harmful content. We coin the term green teaming to describe methods of bypassing GM content filters to design for beneficial use cases. We showcase green teaming by: 1) Using ChatGPT as a virtual patient to simulate a person experiencing suicidal ideation, for suicide support training; 2) Using Codex to intentionally generate buggy solutions to train students on debugging; and 3) Examining an Instagram page using Midjourney to generate images of anti-LGBTQ+ politicians in drag. Finally, we discuss how our use cases demonstrate green teaming as both a practical design method and a mode of critique, which problematizes and subverts current understandings of harms and values in generative AI.
HCFeb 1, 2021
Soliciting Stakeholders' Fairness Notions in Child Maltreatment Predictive SystemsHao-Fei Cheng, Logan Stapleton, Ruiqi Wang et al.
Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeholders' subjective fairness notions. Combining a user interface that allows stakeholders to examine the data and the algorithm's predictions with an interview protocol to probe stakeholders' thoughts while they are interacting with the interface, we can identify stakeholders' fairness beliefs and principles. We conduct a user study to evaluate our framework in the setting of a child maltreatment predictive system. Our evaluations show that the framework allows stakeholders to comprehensively convey their fairness viewpoints. We also discuss how our results can inform the design of predictive systems.
HCJan 4, 2021
"Brilliant AI Doctor" in Rural China: Tensions and Challenges in AI-Powered CDSS DeploymentDakuo Wang, Liuping Wang, Zhan Zhang et al.
Artificial intelligence (AI) technology has been increasingly used in the implementation of advanced Clinical Decision Support Systems (CDSS). Research demonstrated the potential usefulness of AI-powered CDSS (AI-CDSS) in clinical decision making scenarios. However, post-adoption user perception and experience remain understudied, especially in developing countries. Through observations and interviews with 22 clinicians from 6 rural clinics in China, this paper reports the various tensions between the design of an AI-CDSS system ("Brilliant Doctor") and the rural clinical context, such as the misalignment with local context and workflow, the technical limitations and usability barriers, as well as issues related to transparency and trustworthiness of AI-CDSS. Despite these tensions, all participants expressed positive attitudes toward the future of AI-CDSS, especially acting as "a doctor's AI assistant" to realize a Human-AI Collaboration future in clinical settings. Finally we draw on our findings to discuss implications for designing AI-CDSS interventions for rural clinical contexts in developing countries.
CYOct 22, 2020
Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through DeliberationHong Shen, Wesley Hanwen Deng, Aditi Chattopadhyay et al.
Recently, there have been increasing calls for computer science curricula to complement existing technical training with topics related to Fairness, Accountability, Transparency, and Ethics. In this paper, we present Value Card, an educational toolkit to inform students and practitioners of the social impacts of different machine learning models via deliberation. This paper presents an early use of our approach in a college-level computer science course. Through an in-class activity, we report empirical data for the initial effectiveness of our approach. Our results suggest that the use of the Value Cards toolkit can improve students' understanding of both the technical definitions and trade-offs of performance metrics and apply them in real-world contexts, help them recognize the significance of considering diverse social values in the development of deployment of algorithmic systems, and enable them to communicate, negotiate and synthesize the perspectives of diverse stakeholders. Our study also demonstrates a number of caveats we need to consider when using the different variants of the Value Cards toolkit. Finally, we discuss the challenges as well as future applications of our approach.
HCJun 30, 2020
Learning to Ignore: A Case Study of Organization-Wide Bulk Email EffectivenessRuoyan Kong, Haiyi Zhu, Joseph A. Konstan
Bulk email is a primary communication channel within organizations, with all-company emails and regular newsletters serving as a mechanism for making employees aware of policies and events. Ineffective communication could result in wasted employee time and a lack of compliance or awareness. Previous studies on organizational emails focused mostly on recipients. However, organizational bulk email system is a multi-stakeholder problem including recipients, communicators, and the organization itself. We studied the effectiveness, practice, and assessments of the organizational bulk email system of a large university from multi-stakeholders' perspectives. We conducted a qualitative study with the university's communicators, recipients, and managers. We delved into the organizational bulk email's distributing mechanisms of the communicators, the reading behaviors of recipients, and the perspectives on emails' values of communicators, managers, and recipients. We found that the organizational bulk email system as a whole was strained, and communicators are caught in the middle of this multi-stakeholder problem. First, though the communicators had an interest in preserving the effectiveness of channels in reaching employees, they had high-level clients whose interests might outweigh judgment about whether a message deserves widespread circulation. Second, though communicators thought they were sending important information, recipients viewed most of the organizational bulk emails as not relevant to them. Third, this disagreement was amplified by the success metric used by communicators. They viewed their bulk emails as successful if they had a high open rate. But recipients often opened and then rapidly discarded emails without reading the details. Last, while the communicators in general understood the challenge, they had a limited set of targeting and feedback tools to support their task.
HCJan 27, 2020
Factors Influencing Perceived Fairness in Algorithmic Decision-Making: Algorithm Outcomes, Development Procedures, and Individual DifferencesRuotong Wang, F. Maxwell Harper, Haiyi Zhu
Algorithmic decision-making systems are increasingly used throughout the public and private sectors to make important decisions or assist humans in making these decisions with real social consequences. While there has been substantial research in recent years to build fair decision-making algorithms, there has been less research seeking to understand the factors that affect people's perceptions of fairness in these systems, which we argue is also important for their broader acceptance. In this research, we conduct an online experiment to better understand perceptions of fairness, focusing on three sets of factors: algorithm outcomes, algorithm development and deployment procedures, and individual differences. We find that people rate the algorithm as more fair when the algorithm predicts in their favor, even surpassing the negative effects of describing algorithms that are very biased against particular demographic groups. We find that this effect is moderated by several variables, including participants' education level, gender, and several aspects of the development procedure. Our findings suggest that systems that evaluate algorithmic fairness through users' feedback must consider the possibility of outcome favorability bias.
HCJan 14, 2020
Disseminating Research News in HCI: Perceived Hazards, How-To's, and Opportunities for InnovationC. Estelle Smith, Eduardo Nevarez, Haiyi Zhu
Mass media afford researchers critical opportunities to disseminate research findings and trends to the general public. Yet researchers also perceive that their work can be miscommunicated in mass media, thus generating unintended understandings of HCI research by the general public. We conduct a Grounded Theory analysis of interviews with 12 HCI researchers and find that miscommunication can occur at four origins along the socio-technical infrastructure known as the Media Production Pipeline (MPP) for science news. Results yield researchers' perceived hazards of disseminating their work through mass media, as well as strategies for fostering effective communication of research. We conclude with implications for augmenting or innovating new MPP technologies.
HCJan 14, 2020
Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based SystemsC. Estelle Smith, Bowen Yu, Anjali Srivastava et al.
On Wikipedia, sophisticated algorithmic tools are used to assess the quality of edits and take corrective actions. However, algorithms can fail to solve the problems they were designed for if they conflict with the values of communities who use them. In this study, we take a Value-Sensitive Algorithm Design approach to understanding a community-created and -maintained machine learning-based algorithm called the Objective Revision Evaluation System (ORES)---a quality prediction system used in numerous Wikipedia applications and contexts. Five major values converged across stakeholder groups that ORES (and its dependent applications) should: (1) reduce the effort of community maintenance, (2) maintain human judgement as the final authority, (3) support differing peoples' differing workflows, (4) encourage positive engagement with diverse editor groups, and (5) establish trustworthiness of people and algorithms within the community. We reveal tensions between these values and discuss implications for future research to improve algorithms like ORES.
HCOct 7, 2019
Keeping Designers in the Loop: Communicating Inherent Algorithmic Trade-offs Across Multiple ObjectivesBowen Yu, Ye Yuan, Loren Terveen et al.
Artificial intelligence algorithms have been used to enhance a wide variety of products and services, including assisting human decision making in high-stakes contexts. However, these algorithms are complex and have trade-offs, notably between prediction accuracy and fairness to population subgroups. This makes it hard for designers to understand algorithms and design products or services in a way that respects users' goals, values, and needs. We proposed a method to help designers and users explore algorithms, visualize their trade-offs, and select algorithms with trade-offs consistent with their goals and needs. We evaluated our method on the problem of predicting criminal defendants' likelihood to re-offend through (i) a large-scale Amazon Mechanical Turk experiment, and (ii) in-depth interviews with domain experts. Our evaluations show that our method can help designers and users of these systems better understand and navigate algorithmic trade-offs. This paper contributes a new way of providing designers the ability to understand and control the outcomes of algorithmic systems they are creating.
HCDec 20, 2018
Towards Value-Sensitive Learning Analytics DesignBodong Chen, Haiyi Zhu
To support ethical considerations and system integrity in learning analytics, this paper introduces two cases of applying the Value Sensitive Design methodology to learning analytics design. The first study applied two methods of Value Sensitive Design, namely stakeholder analysis and value analysis, to a conceptual investigation of an existing learning analytics tool. This investigation uncovered a number of values and value tensions, leading to design trade-offs to be considered in future tool refinements. The second study holistically applied Value Sensitive Design to the design of a recommendation system for the Wikipedia WikiProjects. To proactively consider values among stakeholders, we derived a multi-stage design process that included literature analysis, empirical investigations, prototype development, community engagement, iterative testing and refinement, and continuous evaluation. By reporting on these two cases, this paper responds to a need of practical means to support ethical considerations and human values in learning analytics systems. These two cases demonstrate that Value Sensitive Design could be a viable approach for balancing a wide range of human values, which tend to encompass and surpass ethical issues, in learning analytics design.