Amy X. Zhang

h-index57

29papers

1,304citations

Novelty44%

AI Score55

Ranked #24,894 of 201,326 authors (top 12%)#45 in HC (top 2%)

29 Papers

HCMar 25, 2023

The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Kyle Lo, Joseph Chee Chang, Andrew Head et al. · allen-ai, cmu

Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question "Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces -- even for legacy PDFs?" We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we've developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We've also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers -- Discovery, Efficiency, Comprehension, Synthesis, and Accessibility -- and present an overview of our progress and remaining open challenges.

63.3SIApr 21

How Conversational Structure and Style Shape Online Community Experiences

Galen Weld, Carl Pearson, Bradley Spahn et al. · uw

Sense of Community (SOC) is vital to individual and collective well-being. Although social interactions have moved increasingly online, still little is known about the specific relationships between the nature of these interactions and Sense of Virtual Community (SOVC). This study addresses this gap by exploring how conversational structure and linguistic style predict SOVC in online communities, using a large-scale survey of 2,826 Reddit users across 281 varied subreddits. We develop a hierarchical model to predict self-reported SOVC based on automatically quantifiable and highly generalizable features that are agnostic to community topic and that describe both individual users and entire communities. We identify specific interaction patterns (e.g., reciprocal reply chains, use of prosocial language) associated with stronger communities and identify three primary dimensions of SOVC within Reddit -- Membership & Belonging, Cooperation & Shared Values, and Connection & Influence. This study provides the first quantitative evidence linking patterns of social interaction to SOVC and highlights actionable strategies for fostering stronger community attachment, using an approach that can generalize readily across community topics, languages, and platforms. These insights offer theoretical implications for the study of online communities and practical suggestions for the design of features to help more individuals experience the positive benefits of online community participation.

CVOct 22, 2023

Semantic and Expressive Variation in Image Captions Across Languages

Andre Ye, Sebastin Santy, Jena D. Hwang et al. · uw

Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.

AINov 18, 2023

Case Repositories: Towards Case-Based Reasoning for AI Alignment

K. J. Kevin Feng, Quan Ze Chen, Inyoung Cheong et al. · uw

Case studies commonly form the pedagogical backbone in law, ethics, and many other domains that face complex and ambiguous societal questions informed by human values. Similar complexities and ambiguities arise when we consider how AI should be aligned in practice: when faced with vast quantities of diverse (and sometimes conflicting) values from different individuals and communities, with whose values is AI to align, and how should AI do so? We propose a complementary approach to constitutional AI alignment, grounded in ideas from case-based reasoning (CBR), that focuses on the construction of policies through judgments on a set of cases. We present a process to assemble such a case repository by: 1) gathering a set of ``seed'' cases -- questions one may ask an AI system -- in a particular domain, 2) eliciting domain-specific key dimensions for cases through workshops with domain experts, 3) using LLMs to generate variations of cases not seen in the wild, and 4) engaging with the public to judge and improve cases. We then discuss how such a case repository could assist in AI alignment, both through directly acting as precedents to ground acceptable behaviors, and as a medium for individuals and communities to engage in moral reasoning around AI.

CYNov 30, 2025

On the Regulatory Potential of User Interfaces for AI Agent Governance

K. J. Kevin Feng, Tae Soo Kim, Rock Yuren Pang et al. · allen-ai

AI agents that take actions in their environment autonomously over extended time horizons require robust governance interventions to curb their potentially consequential risks. Prior proposals for governing AI agents primarily target system-level safeguards (e.g., prompt injection monitors) or agent infrastructure (e.g., agent IDs). In this work, we explore a complementary approach: regulating user interfaces of AI agents as a way of enforcing transparency and behavioral requirements that then demand changes at the system and/or infrastructure levels. Specifically, we analyze 22 existing agentic systems to identify UI elements that play key roles in human-agent interaction and communication. We then synthesize those elements into six high-level interaction design patterns that hold regulatory potential (e.g., requiring agent memory to be editable). We conclude with policy recommendations based on our analysis. Our work exposes a new surface for regulatory action that supplements previous proposals for practical AI agent governance.

91.2HCMay 12

Hedwig: Dynamic Autonomy for Coding Agents Under Local Oversight

Tanjal Shukla, K. J. Kevin Feng, Leijie Wang et al.

Despite coding agents' advances in handling increasingly complex tasks, their continued tendency to introduce unintended edits, subtle bugs, and scope drift that slip past code review means developers must still decide how much autonomy to grant them. However, existing approaches for setting an agent's level of autonomy, such as static permission settings or instruction files, cannot account for how developers' preferences for agent autonomy can shift across tasks and over time. We conducted a formative survey with 21 software engineers who use coding agents and found that they experience frustration with calibrating autonomy and have evolving preferences for level of oversight. Building on these insights, we present Hedwig, a CLI coding agent that dynamically adjusts its autonomy level based on developer-agent interactions across sessions. Rather than operating on a global, fixed autonomy configuration, Hedwig learns an evolving set of behavioral guidelines from developer decisions and feedback, reducing friction on work for which the agent has earned trust, while tightening oversight when the agent operates outside familiar territory. Hedwig demonstrates the potential of a new paradigm where agents intelligently adapt their level of autonomy based on user trust through active, longitudinal collaboration.

CYFeb 2, 2024

(A)I Am Not a Lawyer, But...: Engaging Legal Experts towards Responsible LLM Policies for Legal Advice

Inyoung Cheong, King Xia, K. J. Kevin Feng et al. · uw

Large language models (LLMs) are increasingly capable of providing users with advice in a wide range of professional domains, including legal advice. However, relying on LLMs for legal queries raises concerns due to the significant expertise required and the potential real-world consequences of the advice. To explore \textit{when} and \textit{why} LLMs should or should not provide advice to users, we conducted workshops with 20 legal experts using methods inspired by case-based reasoning. The provided realistic queries ("cases") allowed experts to examine granular, situation-specific concerns and overarching technical and legal constraints, producing a concrete set of contextual considerations for LLM developers. By synthesizing the factors that impacted LLM response appropriateness, we present a 4-dimension framework: (1) User attributes and behaviors, (2) Nature of queries, (3) AI capabilities, and (4) Social impacts. We share experts' recommendations for LLM response strategies, which center around helping users identify `right questions to ask' and relevant information rather than providing definitive legal judgments. Our findings reveal novel legal considerations, such as unauthorized practice of law, confidentiality, and liability for inaccurate advice, that have been overlooked in the literature. The case-based deliberation method enabled us to elicit fine-grained, practice-informed insights that surpass those from de-contextualized surveys or speculative principles. These findings underscore the applicability of our method for translating domain-specific professional knowledge and practices into policies that can guide LLM behavior in a more responsible direction.

CLOct 30, 2024

LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions

Zhehui Liao, Maria Antoniak, Inyoung Cheong et al.

The rise of large language models (LLMs) has led many researchers to consider their usage for scientific work. Some have found benefits using LLMs to augment or automate aspects of their research pipeline, while others have urged caution due to risks and ethical concerns. Yet little work has sought to quantify and characterize how researchers use LLMs and why. We present the first large-scale survey of 816 verified research article authors to understand how the research community leverages and perceives LLMs as research tools. We examine participants' self-reported LLM usage, finding that 81% of researchers have already incorporated LLMs into different aspects of their research workflow. We also find that traditionally disadvantaged groups in academia (non-White, junior, and non-native English speaking researchers) report higher LLM usage and perceived benefits, suggesting potential for improved research equity. However, women, non-binary, and senior researchers have greater ethical concerns, potentially hindering adoption.

HCApr 6, 2024

Language Models as Critical Thinking Tools: A Case Study of Philosophers

Andre Ye, Jared Moore, Rose Novick et al. · uw

Current work in language models (LMs) helps us speed up or even skip thinking by accelerating and automating cognitive work. But can LMs help us with critical thinking -- thinking in deeper, more reflective ways which challenge assumptions, clarify ideas, and engineer new concepts? We treat philosophy as a case study in critical thinking, and interview 21 professional philosophers about how they engage in critical thinking and on their experiences with LMs. We find that philosophers do not find LMs to be useful because they lack a sense of selfhood (memory, beliefs, consistency) and initiative (curiosity, proactivity). We propose the selfhood-initiative model for critical thinking tools to characterize this gap. Using the model, we formulate three roles LMs could play as critical thinking tools: the Interlocutor, the Monitor, and the Respondent. We hope that our work inspires LM researchers to further develop LMs as critical thinking tools and philosophers and other 'critical thinkers' to imagine intellectually substantive uses of LMs.

HCJun 14, 2025

Levels of Autonomy for AI Agents

K. J. Kevin Feng, David W. McDonald, Amy X. Zhang

Autonomy is a double-edged sword for AI agents, simultaneously unlocking transformative possibilities and serious risks. How can agent developers calibrate the appropriate levels of autonomy at which their agents should operate? We argue that an agent's level of autonomy can be treated as a deliberate design decision, separate from its capability and operational environment. In this work, we define five levels of escalating agent autonomy, characterized by the roles a user can take when interacting with an agent: operator, collaborator, consultant, approver, and observer. Within each level, we describe the ways by which a user can exert control over the agent and open questions for how to design the nature of user-agent interaction. We then highlight a potential application of our framework towards AI autonomy certificates to govern agent behavior in single- and multi-agent systems. We conclude by proposing early ideas for evaluating agents' autonomy. Our work aims to contribute meaningful, practical steps towards responsibly deployed and useful AI agents in the real world.

CLMar 17, 2024

Correcting misinformation on social media with a large language model

Xinyi Zhou, Ashish Sharma, Amy X. Zhang et al. · uw

Real-world misinformation, often multimodal, can be partially or fully factual but misleading using diverse tactics like conflating correlation with causation. Such misinformation is severely understudied, challenging to address, and harms various social domains, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies effectively reduces false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable. While LLMs have versatile capabilities that could accelerate misinformation correction, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or supporting context, MUSE identifies and explains content (in)accuracies with references. It conducts multimodal retrieval and interprets visual content to verify and correct multimodal content. Given the absence of a comprehensive evaluation approach, we propose 13 dimensions of misinformation correction quality. Then, fact-checking experts evaluate responses to social media content that are not presupposed to be misinformation but broadly include (partially) incorrect and correct posts that may (not) be misleading. Results demonstrate MUSE's ability to write high-quality responses to potential misinformation--across modalities, tactics, domains, political leanings, and for information that has not previously been fact-checked online--within minutes of its appearance on social media. Overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. Our work provides a general methodological and evaluative framework to correct misinformation at scale.

CLFeb 10

Are Language Models Sensitive to Morally Irrelevant Distractors?

Andrew Shaw, Christina Hahn, Catherine Rasgaitis et al.

With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.

CLNov 16, 2024

SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment

Quan Ze Chen, K. J. Kevin Feng, Chan Young Park et al.

When different groups' values differ, one approach to model alignment is to steer models at inference time towards each group's preferences. However, techniques like in-context learning only consider similarity when drawing few-shot examples and not cross-group differences in values. We propose SPICA, a framework that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs: scenario banks, group-informed retrieval metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ($n = 544$), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ($n = 120$), we observe that SPICA is higher rated than similarity-based retrieval, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can align to aggregated values, it is not most suited for divergent groups.

HCNov 15, 2024

Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment

Andrew Konya, Aviv Ovadya, Kevin Feng et al.

We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our `chain of alignment' (CoA) approach produces a rule based reward (RBR) by creating model behavior $\textit{rules}$ aligned to normative $\textit{objectives}$ aligned to $\textit{public will}$. This factoring enables a nonexpert public to directly specify their will through the normative objectives, while expert intelligence is used to figure out rules entailing model behavior that best achieves those objectives. We validate our approach by applying it across three different domains of LM prompts related to mental health. We demonstrate a public input process built on collective dialogues and bridging-based ranking that reliably produces normative objectives supported by at least $96\% \pm 2\%$ of the US public. We then show that rules developed by mental health experts to achieve those objectives enable a RBR that evaluates an LM response's alignment with the objectives similarly to human experts (Pearson's $r=0.841$, $AUC=0.964$). By measuring alignment with objectives that have near unanimous public support, these CoA RBRs provide an approximate measure of alignment between LM behavior and public will.

HCSep 24, 2025

PolicyPad: Collaborative Prototyping of LLM Policies

K. J. Kevin Feng, Tzu-Sheng Kuo, Quan Ze et al.

As LLMs gain adoption in high-stakes domains like mental health, domain experts are increasingly consulted to provide input into policies governing their behavior. From an observation of 19 policymaking workshops with 9 experts over 15 weeks, we identified opportunities to better support rapid experimentation, feedback, and iteration for collaborative policy design processes. We present PolicyPad, an interactive system that facilitates the emerging practice of LLM policy prototyping by drawing from established UX prototyping practices, including heuristic evaluation and storyboarding. Using PolicyPad, policy designers can collaborate on drafting a policy in real time while independently testing policy-informed model behavior with usage scenarios. We evaluate PolicyPad through workshops with 8 groups of 22 domain experts in mental health and law, finding that PolicyPad enhanced collaborative dynamics during policy design, enabled tight feedback loops, and led to novel policy contributions. Overall, our work paves participatory paths for advancing AI alignment and safety.

CYJul 2, 2025

Penalizing Transparency? How AI Disclosure and Author Demographics Shape Human and AI Judgments About Writing

Inyoung Cheong, Alicia Guo, Mina Lee et al.

As AI integrates in various types of human writing, calls for transparency around AI assistance are growing. However, if transparency operates on uneven ground and certain identity groups bear a heavier cost for being honest, then the burden of openness becomes asymmetrical. This study investigates how AI disclosure statement affects perceptions of writing quality, and whether these effects vary by the author's race and gender. Through a large-scale controlled experiment, both human raters (n = 1,970) and LLM raters (n = 2,520) evaluated a single human-written news article while disclosure statements and author demographics were systematically varied. This approach reflects how both human and algorithmic decisions now influence access to opportunities (e.g., hiring, promotion) and social recognition (e.g., content recommendation algorithms). We find that both human and LLM raters consistently penalize disclosed AI use. However, only LLM raters exhibit demographic interaction effects: they favor articles attributed to women or Black authors when no disclosure is present. But these advantages disappear when AI assistance is revealed. These findings illuminate the complex relationships between AI disclosure and author identity, highlighting disparities between machine and human evaluation patterns.

CVMay 15, 2023

Skin Deep: Investigating Subjectivity in Skin Tone Annotations for Computer Vision Benchmark Datasets

Teanna Barrett, Quan Ze Chen, Amy X. Zhang

To investigate the well-observed racial disparities in computer vision systems that analyze images of humans, researchers have turned to skin tone as more objective annotation than race metadata for fairness performance evaluations. However, the current state of skin tone annotation procedures is highly varied. For instance, researchers use a range of untested scales and skin tone categories, have unclear annotation procedures, and provide inadequate analyses of uncertainty. In addition, little attention is paid to the positionality of the humans involved in the annotation process--both designers and annotators alike--and the historical and sociological context of skin tone in the United States. Our work is the first to investigate the skin tone annotation process as a sociotechnical project. We surveyed recent skin tone annotation procedures and conducted annotation experiments to examine how subjective understandings of skin tone are embedded in skin tone annotation procedures. Our systematic literature review revealed the uninterrogated association between skin tone and race and the limited effort to analyze annotator uncertainty in current procedures for skin tone annotation in computer vision evaluation. Our experiments demonstrated that design decisions in the annotation procedure such as the order in which the skin tone scale is presented or additional context in the image (i.e., presence of a face) significantly affected the resulting inter-annotator agreement and individual uncertainty of skin tone annotations. We call for greater reflexivity in the design, analysis, and documentation of procedures for evaluation using skin tone.

HCFeb 13, 2022

Comparing the Perceived Legitimacy of Content Moderation Processes: Contractors, Algorithms, Expert Panels, and Digital Juries

Christina A. Pan, Sahil Yakhmi, Tara P. Iyer et al.

While research continues to investigate and improve the accuracy, fairness, and normative appropriateness of content moderation processes on large social media platforms, even the best process cannot be effective if users reject its authority as illegitimate. We present a survey experiment comparing the perceived institutional legitimacy of four popular content moderation processes. We conducted a within-subjects experiment in which we showed US Facebook users moderation decisions and randomized the description of whether those decisions were made by paid contractors, algorithms, expert panels, or juries of users. Prior work suggests that juries will have the highest perceived legitimacy due to the benefits of judicial independence and democratic representation. However, expert panels had greater perceived legitimacy than algorithms or juries. Moreover, outcome alignment - agreement with the decision - played a larger role than process in determining perceived legitimacy. These results suggest benefits to incorporating expert oversight in content moderation and underscore that any process will face legitimacy challenges derived from disagreement about outcomes.

SINov 10, 2021

What Makes Online Communities 'Better'? Measuring Values, Consensus, and Conflict across Thousands of Subreddits

Galen Weld, Amy X. Zhang, Tim Althoff

Making online social communities 'better' is a challenging undertaking, as online communities are extraordinarily varied in their size, topical focus, and governance. As such, what is valued by one community may not be valued by another. However, community values are challenging to measure as they are rarely explicitly stated. In this work, we measure community values through the first large-scale survey of community values, including 2,769 reddit users in 2,151 unique subreddits. Through a combination of survey responses and a quantitative analysis of public reddit data, we characterize how these values vary within and across communities. Amongst other findings, we show that community members disagree about how safe their communities are, that longstanding communities place 30.1% more importance on trustworthiness than newer communities, and that community moderators want their communities to be 56.7% less democratic than non-moderator community members. These findings have important implications, including suggesting that care must be taken to protect vulnerable community members, and that participatory governance strategies may be difficult to implement. Accurate and scalable modeling of community values enables research and governance which is tuned to each community's different values. To this end, we demonstrate that a small number of automatically quantifiable features capture a significant yet limited amount of the variation in values between communities with a ROC AUC of 0.667 on a binary classification task. However, substantial variation remains, and modeling community values remains an important topic for future work. We make our models and data public to inform community design and governance.

HCSep 11, 2021

Making Online Communities 'Better': A Taxonomy of Community Values on Reddit

Galen Weld, Amy X. Zhang, Tim Althoff

Many researchers studying online communities seek to make them better. However, beyond a small set of widely-held values, such as combating misinformation and abuse, determining what 'better' means can be challenging, as community members may disagree, values may be in conflict, and different communities may have differing preferences as a whole. In this work, we present the first study that elicits values directly from members across a diverse set of communities. We survey 212 members of 627 unique subreddits and ask them to describe their values for their communities in their own words. Through iterative categorization of 1,481 responses, we develop and validate a comprehensive taxonomy of community values, consisting of 29 subcategories within nine top-level categories, enabling principled, quantitative study of community values by researchers. Using our taxonomy, we reframe existing research problems, such as managing influxes of new members, as tensions between different values, and we identify understudied values, such as those regarding content quality and community size. We call for greater attention to vulnerable community members' values, and we make our codebook public for use in future research.

HCAug 4, 2021

Goldilocks: Consistent Crowdsourced Scalar Annotations with Relative Uncertainty

Quanze Chen, Daniel S. Weld, Amy X. Zhang

Human ratings have become a crucial resource for training and evaluating machine learning systems. However, traditional elicitation methods for absolute and comparative rating suffer from issues with consistency and often do not distinguish between uncertainty due to disagreement between annotators and ambiguity inherent to the item being rated. In this work, we present Goldilocks, a novel crowd rating elicitation technique for collecting calibrated scalar annotations that also distinguishes inherent ambiguity from inter-annotator disagreement. We introduce two main ideas: grounding absolute rating scales with examples and using a two-step bounding process to establish a range for an item's placement. We test our designs in three domains: judging toxicity of online comments, estimating satiety of food depicted in images, and estimating age based on portraits. We show that (1) Goldilocks can improve consistency in domains where interpretation of the scale is not universal, and that (2) representing items with ranges lets us simultaneously capture different sources of uncertainty leading to better estimates of pairwise relationship distributions.

HCJan 28, 2021

Exploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media

Farnaz Jahanbakhsh, Amy X. Zhang, Adam J. Berinsky et al.

When users on social media share content without considering its veracity, they may unwittingly be spreading misinformation. In this work, we investigate the design of lightweight interventions that nudge users to assess the accuracy of information as they share it. Such assessment may deter users from posting misinformation in the first place, and their assessments may also provide useful guidance to friends aiming to assess those posts themselves. In support of lightweight assessment, we first develop a taxonomy of the reasons why people believe a news claim is or is not true; this taxonomy yields a checklist that can be used at posting time. We conduct evaluations to demonstrate that the checklist is an accurate and comprehensive encapsulation of people's free-response rationales. In a second experiment, we study the effects of three behavioral nudges -- 1) checkboxes indicating whether headings are accurate, 2) tagging reasons (from our taxonomy) that a post is accurate via a checklist and 3) providing free-text rationales for why a headline is or is not accurate -- on people's intention of sharing the headline on social media. From an experiment with 1668 participants, we find that both providing accuracy assessment and rationale reduce the sharing of false content. They also reduce the sharing of true content, but to a lesser degree that yields an overall decrease in the fraction of shared content that is false. Our findings have implications for designing social media and news sharing platforms that draw from richer signals of content credibility contributed by users. In addition, our validated taxonomy can be used by platforms and researchers as a way to gather rationales in an easier fashion than free-response.

MLNov 29, 2020

Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models

Amy X. Zhang, Le Bao, Changcheng Li et al.

We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models (BHRMs). Bayesian hierarchical models are popular for their ability to model complex dependence structures and provide probabilistic uncertainty estimates, but can be computationally expensive to run. Cross-validation (CV) is therefore not a common practice to evaluate the predictive performance of BHRMs. Our method circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs. By conditioning on the variance-covariance parameters, we shift the CV problem from probability-based sampling to a simple and familiar optimization problem. In many cases, this produces estimates which are equivalent to full CV. We provide theoretical results and demonstrate its efficacy on publicly available data and in simulations.

HCSep 18, 2020

CommunityClick: Capturing and Reporting Community Feedback from Town Halls to Improve Inclusivity

Mahmood Jasim, Pooya Khaloo, Somin Wadhwa et al.

Local governments still depend on traditional town halls for community consultation, despite problems such as a lack of inclusive participation for attendees and difficulty for civic organizers to capture attendees' feedback in reports. Building on a formative study with 66 town hall attendees and 20 organizers, we designed and developed CommunityClick, a communitysourcing system that captures attendees' feedback in an inclusive manner and enables organizers to author more comprehensive reports. During the meeting, in addition to recording meeting audio to capture vocal attendees' feedback, we modify iClickers to give voice to reticent attendees by allowing them to provide real-time feedback beyond a binary signal. This information then automatically feeds into a meeting transcript augmented with attendees' feedback and organizers' tags. The augmented transcript along with a feedback-weighted summary of the transcript generated from text analysis methods is incorporated into an interactive authoring tool for organizers to write reports. From a field experiment at a town hall meeting, we demonstrate how CommunityClick can improve inclusivity by providing multiple avenues for attendees to share opinions. Additionally, interviews with eight expert organizers demonstrate CommunityClick's utility in creating more comprehensive and accurate reports to inform critical civic decision-making. We discuss the possibility of integrating CommunityClick with town hall meetings in the future as well as expanding to other domains.

HCSep 16, 2020

A System for Interleaving Discussion and Summarization in Online Collaboration

Sunny Tian, Amy X. Zhang, David Karger

In many instances of online collaboration, ideation and deliberation about what to write happen separately from the synthesis of the deliberation into a cohesive document. However, this may result in a final document that has little connection to the discussion that came before. In this work, we present interleaved discussion and summarization, a process where discussion and summarization are woven together in a single space, and collaborators can switch back and forth between discussing ideas and summarizing discussion until it results in a final document that incorporates and references all discussion points. We implement this process into a tool called Wikum+ that allows groups working together on a project to create living summaries-artifacts that can grow as new collaborators, ideas, and feedback arise and shrink as collaborators come to consensus. We conducted studies where groups of six people each collaboratively wrote a proposal using Wikum+ and a proposal using a messaging platform along with Google Docs. We found that Wikum+'s integration of discussion and summarization helped users be more organized, allowing for light-weight coordination and iterative improvements throughout the collaboration process. A second study demonstrated that in larger groups, Wikum+ is more inclusive of all participants and more comprehensive in the final document compared to traditional tools.

HCAug 21, 2020

Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria

Md Momen Bhuiyan, Amy X. Zhang, Connie Moon Sehat et al.

Misinformation about critical issues such as climate change and vaccine safety is oftentimes amplified on online social and search platforms. The crowdsourcing of content credibility assessment by laypeople has been proposed as one strategy to combat misinformation by attempting to replicate the assessments of experts at scale. In this work, we investigate news credibility assessments by crowds versus experts to understand when and how ratings between them differ. We gather a dataset of over 4,000 credibility assessments taken from 2 crowd groups---journalism students and Upwork workers---as well as 2 expert groups---journalists and scientists---on a varied set of 50 news articles related to climate science, a topic with widespread disconnect between public opinion and expert consensus. Examining the ratings, we find differences in performance due to the makeup of the crowd, such as rater demographics and political leaning, as well as the scope of the tasks that the crowd is assigned to rate, such as the genre of the article and partisanship of the publication. Finally, we find differences between expert assessments due to differing expert criteria that journalism versus science experts use---differences that may contribute to crowd discrepancies, but that also suggest a way to reduce the gap by designing crowd tasks tailored to specific expert criteria. From these findings, we outline future research directions to better design crowd processes that are tailored to specific crowds and types of content.

CYAug 10, 2020

PolicyKit: Building Governance in Online Communities

Amy X. Zhang, Grant Hugh, Michael S. Bernstein

The software behind online community platforms encodes a governance model that represents a strikingly narrow set of governance possibilities focused on moderators and administrators. When online communities desire other forms of government, such as ones that take many members' opinions into account or that distribute power in non-trivial ways, communities must resort to laborious manual effort. In this paper, we present PolicyKit, a software infrastructure that empowers online community members to concisely author a wide range of governance procedures and automatically carry out those procedures on their home platforms. We draw on political science theory to encode community governance into policies, or short imperative functions that specify a procedure for determining whether a user-initiated action can execute. Actions that can be governed by policies encompass everyday activities such as posting or moderating a message, but actions can also encompass changes to the policies themselves, enabling the evolution of governance over time. We demonstrate the expressivity of PolicyKit through implementations of governance models such as a random jury deliberation, a multi-stage caucus, a reputation system, and a promotion procedure inspired by Wikipedia's Request for Adminship (RfA) process.

HCJan 18, 2020

How do Data Science Workers Collaborate? Roles, Workflows, and Tools

Amy X. Zhang, Michael Muller, Dakuo Wang

Today, the prominence of data science within organizations has given rise to teams of data science workers collaborating on extracting insights from data, as opposed to individual data scientists working alone. However, we still lack a deep understanding of how data science workers collaborate in practice. In this work, we conducted an online survey with 183 participants who work in various aspects of data science. We focused on their reported interactions with each other (e.g., managers with engineers) and with different tools (e.g., Jupyter Notebook). We found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model). We also found that the collaborative practices workers employ, such as documentation, vary according to the kinds of tools they use. Based on these findings, we discuss design implications for supporting data science team collaborations and future research directions.

CYSep 29, 2014

Controversy and Sentiment in Online News

Yelena Mejova, Amy X. Zhang, Nicholas Diakopoulos et al.

How do news sources tackle controversial issues? In this work, we take a data-driven approach to understand how controversy interplays with emotional expression and biased language in the news. We begin by introducing a new dataset of controversial and non-controversial terms collected using crowdsourcing. Then, focusing on 15 major U.S. news outlets, we compare millions of articles discussing controversial and non-controversial issues over a span of 7 months. We find that in general, when it comes to controversial issues, the use of negative affect and biased language is prevalent, while the use of strong emotion is tempered. We also observe many differences across news sources. Using these findings, we show that we can indicate to what extent an issue is controversial, by comparing it with other issues in terms of how they are portrayed across different media.