Robin Welsch

HC
h-index3
6papers
98citations
Novelty48%
AI Score44

6 Papers

LGJan 24, 2023
Investigating Labeler Bias in Face Annotation for Machine Learning

Luke Haliburton, Sinksar Ghebremedhin, Robin Welsch et al. · cmu

In a world increasingly reliant on artificial intelligence, it is more important than ever to consider the ethical implications of artificial intelligence on humanity. One key under-explored challenge is labeler bias, which can create inherently biased datasets for training and subsequently lead to inaccurate or unfair decisions in healthcare, employment, education, and law enforcement. Hence, we conducted a study to investigate and measure the existence of labeler bias using images of people from different ethnicities and sexes in a labeling task. Our results show that participants possess stereotypes that influence their decision-making process and that labeler demographics impact assigned labels. We also discuss how labeler bias influences datasets and, subsequently, the models trained on them. Overall, a high degree of transparency must be maintained throughout the entire artificial intelligence training process to identify and correct biases in the data as early as possible.

HCMar 6, 2023
The AI Ghostwriter Effect: When Users Do Not Perceive Ownership of AI-Generated Text But Self-Declare as Authors

Fiona Draxler, Anna Werner, Florian Lehmann et al.

Human-AI interaction in text production increases complexity in authorship. In two empirical studies (n1 = 30 & n2 = 96), we investigate authorship and ownership in human-AI collaboration for personalized language generation. We show an AI Ghostwriter Effect: Users do not consider themselves the owners and authors of AI-generated text but refrain from publicly declaring AI authorship. Personalization of AI-generated texts did not impact the AI Ghostwriter Effect, and higher levels of participants' influence on texts increased their sense of ownership. Participants were more likely to attribute ownership to supposedly human ghostwriters than AI ghostwriters, resulting in a higher ownership-authorship discrepancy for human ghostwriters. Rationalizations for authorship in AI ghostwriters and human ghostwriters were similar. We discuss how our findings relate to psychological ownership and human-AI interaction to lay the foundations for adapting authorship frameworks and user interfaces in AI in text-generation tasks.

HCSep 28, 2023
"AI enhances our performance, I have no doubt this one will do the same": The Placebo effect is robust to negative descriptions of AI

Agnes M. Kloft, Robin Welsch, Thomas Kosch et al.

Heightened AI expectations facilitate performance in human-AI interactions through placebo effects. While lowering expectations to control for placebo effects is advisable, overly negative expectations could induce nocebo effects. In a letter discrimination task, we informed participants that an AI would either increase or decrease their performance by adapting the interface, but in reality, no AI was present in any condition. A Bayesian analysis showed that participants had high expectations and performed descriptively better irrespective of the AI description when a sham-AI was present. Using cognitive modeling, we could trace this advantage back to participants gathering more information. A replication study verified that negative AI descriptions do not alter expectations, suggesting that performance expectations with AI are biased and robust to negative verbal descriptions. We discuss the impact of user expectations on AI interactions and evaluation and provide a behavioral placebo marker for human-AI interaction

64.8HCMay 25
Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

Daniela Fernandes, Daniel Buschek, Lev Tankelevitch et al.

Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.

HCJul 30, 2024
Questionnaires for Everyone: Streamlining Cross-Cultural Questionnaire Adaptation with GPT-Based Translation Quality Evaluation

Otso Haavisto, Robin Welsch

Adapting questionnaires to new languages is a resource-intensive process often requiring the hiring of multiple independent translators, which limits the ability of researchers to conduct cross-cultural research and effectively creates inequalities in research and society. This work presents a prototype tool that can expedite the questionnaire translation process. The tool incorporates forward-backward translation using DeepL alongside GPT-4-generated translation quality evaluations and improvement suggestions. We conducted two online studies in which participants translated questionnaires from English to either German (Study 1; n=10) or Portuguese (Study 2; n=20) using our prototype. To evaluate the quality of the translations created using the tool, evaluation scores between conventionally translated and tool-supported versions were compared. Our results indicate that integrating LLM-generated translation quality evaluations and suggestions for improvement can help users independently attain results similar to those provided by conventional, non-NLP-supported translation methods. This is the first step towards more equitable questionnaire-based research, powered by AI.

HCSep 15, 2025
The AI Memory Gap: Users Misremember What They Created With AI or Without

Tim Zindulka, Sven Goller, Daniela Fernandes et al.

As large language models (LLMs) become embedded in interactive text generation, disclosure of AI as a source depends on people remembering which ideas or texts came from themselves and which were created with AI. We investigate how accurately people remember the source of content when using AI. In a pre-registered experiment, 184 participants generated and elaborated on ideas both unaided and with an LLM-based chatbot. One week later, they were asked to identify the source (noAI vs withAI) of these ideas and texts. Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI. We validated our results using a computational model of source memory. Discussing broader implications, we highlight the importance of considering source confusion in the design and use of interactive text generation technologies.