HCSep 19, 2023
Writer-Defined AI Personas for On-Demand Feedback GenerationKarim Benharrak, Tim Zindulka, Florian Lehmann et al.
Compelling writing is tailored to its audience. This is challenging, as writers may struggle to empathize with readers, get feedback in time, or gain access to the target group. We propose a concept that generates on-demand feedback, based on writer-defined AI personas of any target audience. We explore this concept with a prototype (using GPT-3.5) in two user studies (N=5 and N=11): Writers appreciated the concept and strategically used personas for getting different perspectives. The feedback was seen as helpful and inspired revisions of text and personas, although it was often verbose and unspecific. We discuss the impact of on-demand feedback, the limited representativity of contemporary AI systems, and further ideas for defining AI personas. This work contributes to the vision of supporting writers with AI by expanding the socio-technical perspective in AI tool design: To empower creators, we also need to keep in mind their relationship to an audience.
14.5HCMar 30
Deception by Design: A Temporal Dark Patterns Audit of McDonald's Self-Ordering Kiosk FlowAditya Kumar Purohit, Yuwei Liu, Manon Berney et al.
Self-ordering kiosks (SOKs) are widely deployed in fast food restaurants, transforming food ordering into digitally mediated, self-navigated interactions. While these systems enhance efficiency and average order value, they also create opportunities for manipulative interface design practices known as dark patterns. This paper presents a structured audit of the McDonald's self-ordering kiosk in Germany using the Temporal Analysis of Dark Patterns (TADP) framework. Through a scenario-based walkthrough simulating a time-pressured user, we reconstructed and analyzed 12 interface steps across intra-page, inter-page, and system levels. We identify recurring high-level strategies implemented through meso-level patterns such as adding steps, false hierarchy, bad defaults, hiding information, and pressured selling, and low-level patterns including visual prominence, confirmshaming, scarcity framing, feedforward ambiguity, emotional sensory manipulation, and partitioned pricing. Our findings demonstrate how these patterns accumulate across the interaction flow and may be amplified by the kiosk's linear task structure and physical context. These findings suggest that hybrid physical--digital consumer interfaces warrant closer scrutiny within emerging regulatory discussions on dark patterns.
HCJan 30
A Conditional Companion: Lived Experiences of People with Mental Health Disorders Using LLMsAditya Kumar Purohit, Hendrik Heuer
Large Language Models (LLMs) are increasingly used for mental health support, yet little is known about how people with mental health challenges engage with them, how they evaluate their usefulness, and what design opportunities they envision. We conducted 20 semi-structured interviews with people in the UK who live with mental health conditions and have used LLMs for mental health support. Through reflexive thematic analysis, we found that participants engaged with LLMs in conditional and situational ways: for immediacy, the desire for non-judgement, self-paced disclosure, cognitive reframing, and relational engagement. Simultaneously, participants articulated clear boundaries informed by prior therapeutic experience: LLMs were effective for mild-to-moderate distress but inadequate for crises, trauma, and complex social-emotional situations. We contribute empirical insights into the lived use of LLMs for mental health, highlight boundary-setting as central to their safe role, and propose design and governance directions for embedding them responsibly within care ecosystem.
HCMar 3, 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic VariationsDavid Hartmann, Amin Oueslati, Dimitri Staufer et al.
Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as ``black'', to predict hate speech. While OpenAI's and Amazon's services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs' limitations. Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.
HCNov 8, 2021
Beyond Participation: A Review of Co-Creation in ComputingJuliane Jarke, Gabriela Molina León, Irina Zakharova et al.
New methods and technologies for engaging future users and other stakeholders in participatory (design) processes are being developed and proposed. Increasingly, researchers refer to co-creation in order to capture such approaches. However, how co-creation is being framed and understood across domains differs substantially. To better understand co-creation in computing, we conducted a literature review of all papers in the ACM Digital Library with co-creation or co-create in their abstracts. After an initial screening, we retained 62 for further analysis. We introduce a framework to analyze different notions of co-creation, distinguishing between co-creation target audiences, the roles of co-creators, the role of technology (as means or objective) and its results. We discuss the adoption of co-creation in domains such as learning, business, arts & culture, health, and the public sector. This paper contributes to the understanding of different approaches and conceptualizations of co-creation in computing and puts forward an agenda for future research.
HCSep 30, 2021
The Explanatory Gap in Algorithmic News CurationHendrik Heuer
Considering the large amount of available content, social media platforms increasingly employ machine learning (ML) systems to curate news. This paper examines how well different explanations help expert users understand why certain news stories are recommended to them. The expert users were journalists, who are trained to judge the relevance of news. Surprisingly, none of the explanations are perceived as helpful. Our investigation provides a first indication of a gap between what is available to explain ML-based curation systems and what users need to understand such systems. We call this the Explanatory Gap in Machine Learning-based Curation Systems.
HCJul 21, 2021
Auditing the Biases Enacted by YouTube for Political Topics in GermanyHendrik Heuer, Hendrik Hoch, Andreas Breiter et al.
With YouTube's growing importance as a news platform, its recommendation system came under increased scrutiny. Recognizing YouTube's recommendation system as a broadcaster of media, we explore the applicability of laws that require broadcasters to give important political, ideological, and social groups adequate opportunity to express themselves in the broadcasted program of the service. We present audits as an important tool to enforce such laws and to ensure that a system operates in the public's interest. To examine whether YouTube is enacting certain biases, we collected video recommendations about political topics by following chains of ten recommendations per video. Our findings suggest that YouTube's recommendation system is enacting important biases. We find that YouTube is recommending increasingly popular but topically unrelated videos. The sadness evoked by the recommended videos decreases while the happiness increases. We discuss the strong popularity bias we identified and analyze the link between the popularity of content and emotions. We also discuss how audits empower researchers and civic hackers to monitor complex machine learning (ML)-based systems like YouTube's recommendation system.
HCJul 21, 2021
Audit, Don't Explain -- Recommendations Based on a Socio-Technical Understanding of ML-Based SystemsHendrik Heuer
In this position paper, I provide a socio-technical perspective on machine learning-based systems. I also explain why systematic audits may be preferable to explainable AI systems. I make concrete recommendations for how institutions governed by public law akin to the German TÜV and Stiftung Warentest can ensure that ML systems operate in the interest of the public.
HCApr 9, 2021
Helping People Deal With Disinformation -- A Socio-Technical PerspectiveHendrik Heuer
At the latest since the advent of the Internet, disinformation and conspiracy theories have become ubiquitous. Recent examples like QAnon and Pizzagate prove that false information can lead to real violence. In this motivation statement for the Workshop on Human Aspects of Misinformation at CHI 2021, I explain my research agenda focused on 1. why people believe in disinformation, 2. how people can be best supported in recognizing disinformation, and 3. what the potentials and risks of different tools designed to fight disinformation are.
CLFeb 26, 2021
Methods for the Design and Evaluation of HCI+NLP SystemsHendrik Heuer, Daniel Buschek
HCI and NLP traditionally focus on different evaluation methods. While HCI involves a small number of people directly and deeply, NLP traditionally relies on standardized benchmark evaluations that involve a larger number of people indirectly. We present five methodological proposals at the intersection of HCI and NLP and situate them in the context of ML-based NLP models. Our goal is to foster interdisciplinary collaboration and progress in both fields by emphasizing what the fields can learn from each other.
HCAug 7, 2020
Middle-Aged Video Consumers' Beliefs About Algorithmic Recommendations on YouTubeOscar Alvarado, Hendrik Heuer, Vero Vanden Abeele et al.
User beliefs about algorithmic systems are constantly co-produced through user interaction and the complex socio-technical systems that generate recommendations. Identifying these beliefs is crucial because they influence how users interact with recommendation algorithms. With no prior work on user beliefs of algorithmic video recommendations, practitioners lack relevant knowledge to improve the user experience of such systems. To address this problem, we conducted semi-structured interviews with middle-aged YouTube video consumers to analyze their user beliefs about the video recommendation system. Our analysis revealed different factors that users believe influence their recommendations. Based on these factors, we identified four groups of user beliefs: Previous Actions, Social Media, Recommender System, and Company Policy. Additionally, we propose a framework to distinguish the four main actors that users believe influence their video recommendations: the current user, other users, the algorithm, and the organization. This framework provides a new lens to explore design suggestions based on the agency of these four actors. It also exposes a novel aspect previously unexplored: the effect of corporate decisions on the interaction with algorithmic recommendations. While we found that users are aware of the existence of the recommendation system on YouTube, we show that their understanding of this system is limited.
HCAug 5, 2020
How Fake News Affect Trust in the Output of a Machine Learning System for News CurationHendrik Heuer, Andreas Breiter
People are increasingly consuming news curated by machine learning (ML) systems. Motivated by studies on algorithmic bias, this paper explores which recommendations of an algorithmic news curation system users trust and how this trust is affected by untrustworthy news stories like fake news. In a study with 82 vocational school students with a background in IT, we found that users are able to provide trust ratings that distinguish trustworthy recommendations of quality news stories from untrustworthy recommendations. However, a single untrustworthy news story combined with four trustworthy news stories is rated similarly as five trustworthy news stories. The results could be a first indication that untrustworthy news stories benefit from appearing in a trustworthy context. The results also show the limitations of users' abilities to rate the recommendations of a news curation system. We discuss the implications of this for the user experience of interactive machine learning systems.
HCAug 5, 2020
More Than Accuracy: Towards Trustworthy Machine Learning Interfaces for Object RecognitionHendrik Heuer, Andreas Breiter
This paper investigates the user experience of visualizations of a machine learning (ML) system that recognizes objects in images. This is important since even good systems can fail in unexpected ways as misclassifications on photo-sharing websites showed. In our study, we exposed users with a background in ML to three visualizations of three systems with different levels of accuracy. In interviews, we explored how the visualization helped users assess the accuracy of systems in use and how the visualization and the accuracy of the system affected trust and reliance. We found that participants do not only focus on accuracy when assessing ML systems. They also take the perceived plausibility and severity of misclassification into account and prefer seeing the probability of predictions. Semantically plausible errors are judged as less severe than errors that are implausible, which means that system accuracy could be communicated through the types of errors.
CLJun 24, 2019
Is It Worth the Attention? A Comparative Evaluation of Attention Layers for Argument Unit SegmentationMaximilian Spliethöver, Jonas Klaff, Hendrik Heuer
Attention mechanisms have seen some success for natural language processing downstream tasks in recent years and generated new State-of-the-Art results. A thorough evaluation of the attention mechanism for the task of Argumentation Mining is missing, though. With this paper, we report a comparative evaluation of attention layers in combination with a bidirectional long short-term memory network, which is the current state-of-the-art approach to the unit segmentation task. We also compare sentence-level contextualized word embeddings to pre-generated ones. Our findings suggest that for this task the additional attention layer does not improve upon a less complex approach. In most cases, the contextualized embeddings do also not show an improvement on the baseline score.
CVOct 12, 2016
Generating captions without looking beyond objectsHendrik Heuer, Christof Monz, Arnold W. M. Smeulders
This paper explores new evaluation perspectives for image captioning and introduces a noun translation task that achieves comparative image caption generation performance by translating from a set of nouns to captions. This implies that in image captioning, all word categories other than nouns can be evoked by a powerful language model without sacrificing performance on n-gram precision. The paper also investigates lower and upper bounds of how much individual word categories in the captions contribute to the final BLEU score. A large possible improvement exists for nouns, verbs, and prepositions.
CLJul 2, 2016
Text comparison using word vector representations and dimensionality reductionHendrik Heuer
This paper describes a technique to compare large text sources using word vector representations (word2vec) and dimensionality reduction (t-SNE) and how it can be implemented using Python. The technique provides a bird's-eye view of text sources, e.g. text summaries and their source material, and enables users to explore text sources like a geographical map. Word vector representations capture many linguistic properties such as gender, tense, plurality and even semantic concepts like "capital city of". Using dimensionality reduction, a 2D map can be computed where semantically similar words are close to each other. The technique uses the word2vec model from the gensim Python library and t-SNE from scikit-learn.