SIMar 20
The Prosocial Ranking Challenge: Reducing Polarization on Social Media without Sacrificing EngagementJonathan Stray, Ian Baker, George Beknazar-Yuzbashev et al. · uw
We report the first direct comparisons of multiple alternative social media algorithms on multiple platforms on outcomes of societal interest. We used a browser extension to modify which posts were shown to desktop social media users, randomly assigning 9,386 users to a control group or one of five alternative ranking algorithms which simultaneously altered content across three platforms for six months during the US 2024 presidential election. This reduced our preregistered index of affective polarization by an average of 0.03 standard deviations (p < 0.05), including a 1.5 degree decrease in differences between the 100 point inparty and outparty feeling thermometers. We saw reductions in active use time for Facebook (-0.37 min/day) and Reddit (-0.2 min/day), but an increase of 0.32 min/day (p < 0.01) for X/Twitter. We saw an increase in reports of negative social media experiences but found no effects on well-being, news knowledge, outgroup empathy, perceptions of and support for partisan violence. This implies that bridging content can improve some societal outcomes without necessarily conflicting with the engagement-driven business model of social media.
LGFeb 21, 2024
Spot Check Equivalence: an Interpretable Metric for Information Elicitation MechanismsShengwei Xu, Yichi Zhang, Paul Resnick et al.
Because high-quality data is like oxygen for AI systems, effectively eliciting information from crowdsourcing workers has become a first-order problem for developing high-performance machine learning algorithms. Two prevalent paradigms, spot-checking and peer prediction, enable the design of mechanisms to evaluate and incentivize high-quality data from human labelers. So far, at least three metrics have been proposed to compare the performances of these techniques [33, 8, 3]. However, different metrics lead to divergent and even contradictory results in various contexts. In this paper, we harmonize these divergent stories, showing that two of these metrics are actually the same within certain contexts and explain the divergence of the third. Moreover, we unify these different contexts by introducing \textit{Spot Check Equivalence}, which offers an interpretable metric for the effectiveness of a peer prediction mechanism. Finally, we present two approaches to compute spot check equivalence in various contexts, where simulation results verify the effectiveness of our proposed metric.
HCJun 13, 2024
Position: Towards Bidirectional Human-AI AlignmentHua Shen, Tiffany Knearem, Reshmi Ghosh et al.
Recent advances in general-purpose AI underscore the urgent need to align AI systems with human goals and values. Yet, the lack of a clear, shared understanding of what constitutes "alignment" limits meaningful progress and cross-disciplinary collaboration. In this position paper, we argue that the research community should explicitly define and critically reflect on "alignment" to account for the bidirectional and dynamic relationship between humans and AI. Through a systematic review of over 400 papers spanning HCI, NLP, ML, and more, we examine how alignment is currently defined and operationalized. Building on this analysis, we introduce the Bidirectional Human-AI Alignment framework, which not only incorporates traditional efforts to align AI with human values but also introduces the critical, underexplored dimension of aligning humans with AI -- supporting cognitive, behavioral, and societal adaptation to rapidly advancing AI technologies. Our findings reveal significant gaps in current literature, especially in long-term interaction design, human value modeling, and mutual understanding. We conclude with three central challenges and actionable recommendations to guide future research toward more nuanced, reciprocal, and human-AI alignment approaches.
CYFeb 1, 2022
Remove, Reduce, Inform: What Actions do People Want Social Media Platforms to Take on Potentially Misleading Content?Shubham Atreja, Libby Hemphill, Paul Resnick
To reduce the spread of misinformation, social media platforms may take enforcement actions against offending content, such as adding informational warning labels, reducing distribution, or removing content entirely. However, both their actions and their inactions have been controversial and plagued by allegations of partisan bias. When it comes to specific content items, surprisingly little is known about what ordinary people want the platforms to do. We provide empirical evidence about a politically balanced panel of lay raters' preferences for three potential platform actions on 368 news articles. Our results confirm that on many articles there is a lack of consensus on which actions to take. We find a clear hierarchy of perceived severity of actions with a majority of raters wanting informational labels on the most articles and removal on the fewest. There was no partisan difference in terms of how many articles deserve platform actions but conservatives did prefer somewhat more action on content from liberal sources, and vice versa. We also find that judgments about two holistic properties, misleadingness and harm, could serve as an effective proxy to determine what actions would be approved by a majority of raters.
HCAug 17, 2021
Searching For or Reviewing Evidence Improves Crowdworkers' Misinformation Judgments and Reduces Partisan BiasPaul Resnick, Aljohara Alfayez, Jane Im et al.
Can crowd workers be trusted to judge whether news-like articles circulating on the Internet are misleading, or does partisanship and inexperience get in the way? And can the task be structured in a way that reduces partisanship? We assembled pools of both liberal and conservative crowd raters and tested three ways of asking them to make judgments about 374 articles. In a no research condition, they were just asked to view the article and then render a judgment. In an individual research condition, they were also asked to search for corroborating evidence and provide a link to the best evidence they found. In a collective research condition, they were not asked to search, but instead to review links collected from workers in the individual research condition. Both research conditions reduced partisan disagreement in judgments. The individual research condition was most effective at producing alignment with journalists' assessments. In this condition, the judgments of a panel of sixteen or more crowd workers were better than that of a panel of three expert journalists, as measured by alignment with a held out journalist's ratings.
LGJun 2, 2021
Rater Equivalence: Evaluating Classifiers in Human Judgment SettingsPaul Resnick, Yuqing Kong, Grant Schoenebeck et al.
In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.
HCSep 3, 2018
GuessTheKarma: A Game to Assess Social Rating SystemsMaria Glenski, Greg Stoddard, Paul Resnick et al.
Popularity systems, like Twitter retweets, Reddit upvotes, and Pinterest pins have the potential to guide people toward posts that others liked. That, however, creates a feedback loop that reduces their informativeness: items marked as more popular get more attention, so that additional upvotes and retweets may simply reflect the increased attention and not independent information about the fraction of people that like the items. How much information remains? For example, how confident can we be that more people prefer item A to item B if item A had hundreds of upvotes on Reddit and item B had only a few? We investigate using an Internet game called GuessTheKarma that collects independent preference judgments (N=20,674) for 400 pairs of images, approximately 50 per pair. Unlike the rating systems that dominate social media services, GuessTheKarma is devoid of social and ranking effects that influence ratings. Overall, Reddit scores were not very good predictors of the true population preferences for items as measured by GuessTheKarma: the image with higher score was preferred by a majority of independent raters only 68% of the time. However, when one image had a low score and the other was one of the highest scoring in its subreddit, the higher scoring image was preferred nearly 90% of the time by the majority of independent raters. Similarly, Imgur view counts for the images were poor predictors except when there were orders of magnitude differences between the pairs. We conclude that popularity systems marked by feedback loops may convey a strong signal about population preferences, but only when comparing items that received vastly different popularity scores.
CLSep 1, 2018
Extractive Adversarial Networks: High-Recall Explanations for Identifying Personal Attacks in Social Media PostsSamuel Carton, Qiaozhu Mei, Paul Resnick
We introduce an adversarial method for producing high-recall explanations of neural text classifier decisions. Building on an existing architecture for extractive explanations via hard attention, we add an adversarial layer which scans the residual of the attention for remaining predictive signal. Motivated by the important domain of detecting personal attacks in social media comments, we additionally demonstrate the importance of manually setting a semantically appropriate `default' behavior for the model by explicitly manipulating its bias term. We develop a validation set of human-annotated personal attacks to evaluate the impact of these changes.