Allison Koenecke

h-index12

13papers

239citations

Novelty35%

AI Score51

Ranked #18,903 of 194,257 authors (top 10%)#3,998 in CL (top 13%)

13 Papers

10.0HCMay 15, 2022

Trucks Don't Mean Trump: Diagnosing Human Error in Image Analysis

J. D. Zamfirescu-Pereira, Jerry Chen, Emily Wen et al. · berkeley

Algorithms provide powerful tools for detecting and dissecting human bias and error. Here, we develop machine learning methods to to analyze how humans err in a particular high-stakes task: image interpretation. We leverage a unique dataset of 16,135,392 human predictions of whether a neighborhood voted for Donald Trump or Joe Biden in the 2020 US election, based on a Google Street View image. We show that by training a machine learning estimator of the Bayes optimal decision for each image, we can provide an actionable decomposition of human error into bias, variance, and noise terms, and further identify specific features (like pickup trucks) which lead humans astray. Our methods can be applied to ensure that human-in-the-loop decision-making is accurate and fair and are also applicable to black-box algorithmic systems.

2.3MEJun 20, 2023

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations

Hammaad Adam, Fan Yin, Huibin et al.

Randomized experiments often need to be stopped prematurely due to the treatment having an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity. In this paper, we study the early stopping of experiments for harm on heterogeneous populations. We first establish that current methods often fail to stop experiments when the treatment harms a minority group of participants. We then use causal machine learning to develop CLASH, the first broadly-applicable method for heterogeneous early stopping. We demonstrate CLASH's performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

7.6CYMay 12

Into the Unknown: Accounting for Missing Demographic Data when Mitigating Ad Delivery Skew

Isabel Corpus, Allison Koenecke

Online advertising platforms use algorithmic systems to power the process of matching ads to users, termed ad delivery. Prior audits have demonstrated that ad delivery can be skewed by demographic attributes, such that ads are systematically under-delivered to certain groups despite advertiser intent to reach groups proportionally. This under-delivery raises a serious concern in the context of ads promoting public services, which might prevent certain groups of individuals from accessing information about resources on the basis of their demographic identity. In the absence of platform-provided solutions to skewed ad delivery, advertisers can counteract skew by targeting demographic groups directly. However, direct targeting excludes users whose demographics the platform cannot infer ("unknown users") if advertising platforms do not provide a way to target unknown users directly, as is the case on Google Ads. We collaborate with a state-level government agency to reduce gender-based skew in ad delivery with an intervention that accounts for unknown users while incorporating gender-based targeting. In particular, we design a budget split intervention that directly incorporates unknown users and targets users with Google-inferred gender labels (i.e., male, female). We find that this intervention is a valuable approach to addressing ad delivery skew without excluding unknown users, and serves as a middle ground in the trade-off between higher costs (from more granular demographic targeting) and skew (from ignoring demographics entirely). This approach is responsive to the needs of real-world, resource-constrained advertisers who are committed to the equitable distribution of public service outreach via online advertising. We conclude with recommendations for government advertisers, online advertising platforms, and researchers.

8.6CYMay 4

A Critical Pragmatism Approach for Algorithmic Fairness: Lessons from Urban Planning Theory

Jennah Gosciak, Karen Levy, Allison Koenecke

As data scientists grapple with increasingly complex ethical decisions in machine learning (ML) and data science, the field of algorithmic fairness has offered multiple solutions, from formal mathematical definitions to holistic notions of fairness drawn from various academic disciplines. However, navigating and implementing these fairness approaches in practice remains an ongoing challenge. In this paper, we draw a parallel between the types of problems arising in algorithmic fairness and urban planning. We frame algorithmic fairness problems as `wicked problems,' a term originating from the planning and policy space to describe the intractable, value-laden, and complex nature of this work. As such, we argue that the field of algorithmic fairness can learn from theoretical work in urban planning in ameliorating its own set of wicked problems. Urban planning is typically concerned with practical issues of governance, resource allocation, stakeholder engagement, and conflicts involving deep-seated differences. These are challenges that existing fairness frameworks can easily overlook. We present a flexible framework for designing fairer algorithms based on the urban planning theory approach of critical pragmatism -- a reflective and deliberative approach to addressing wicked problems that considers what practitioners actually do in the face of conflict and power. We provide specific recommendations and apply them to several case studies in ML and algorithm design: automated mortgage lending, school choice, and feminicide counterdata collection. Researchers and practitioners can incorporate these recommendations derived from urban planning into their ongoing work to more holistically address practical problems arising in fair algorithm design.

14.7CLMay 28, 2025Code

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Hanjia Lyu, Jiebo Luo, Jian Kang et al.

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

4.8CLJul 17, 2024Code

Automate or Assist? The Role of Computational Models in Identifying Gendered Discourse in US Capital Trial Transcripts

Andrea W Wen-Yi, Kathryn Adamson, Nathalie Greenfield et al.

The language used by US courtroom actors in criminal trials has long been studied for biases. However, systematic studies for bias in high-stakes court trials have been difficult, due to the nuanced nature of bias and the legal expertise required. Large language models offer the possibility to automate annotation. But validating the computational approach requires both an understanding of how automated methods fit in existing annotation workflows and what they really offer. We present a case study of adding a computational model to a complex and high-stakes problem: identifying gender-biased language in US capital trials for women defendants. Our team of experienced death-penalty lawyers and NLP technologists pursue a three-phase study: first annotating manually, then training and evaluating computational models, and finally comparing expert annotations to model predictions. Unlike many typical NLP tasks, annotating for gender bias in months-long capital trials is complicated, with many individual judgment calls. Contrary to standard arguments for automation that are based on efficiency and scalability, legal experts find the computational models most useful in providing opportunities to reflect on their own bias in annotation and to build consensus on annotation rules. This experience suggests that seeking to replace experts with computational models for complex annotation is both unrealistic and undesirable. Rather, computational models offer valuable opportunities to assist the legal experts in annotation-based studies.

10.5HCMar 11

LLMs in social services: How does chatbot accuracy affect human accuracy?

Jennah Gosciak, Eric Giannella, Zhaowen Guo et al.

Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.

18.1CLFeb 12, 2024Code

Careless Whisper: Speech-to-Text Hallucination Harms

Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei et al.

Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

8.3CLOct 1, 2025

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve et al.

Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

7.2HCMay 26, 2025

Fairness-in-the-Workflow: How Machine Learning Practitioners at Big Tech Companies Approach Fairness in Recommender Systems

Jing Nathan Yan, Emma Harvey, Junxiong Wang et al.

Recommender systems (RS), which are widely deployed across high-stakes domains, are susceptible to biases that can cause large-scale societal impacts. Researchers have proposed methods to measure and mitigate such biases -- but translating academic theory into practice is inherently challenging. RS practitioners must balance the competing interests of diverse stakeholders, including providers and users, and operate in dynamic environments. Through a semi-structured interview study (N=11), we map the RS practitioner workflow within large technology companies, focusing on how technical teams consider fairness internally and in collaboration with other (legal, data, and fairness) teams. We identify key challenges to incorporating fairness into existing RS workflows: defining fairness in RS contexts, particularly when navigating multi-stakeholder and dynamic fairness considerations. We also identify key organization-wide challenges: making time for fairness work and facilitating cross-team communication. Finally, we offer actionable recommendations for the RS community, including HCI researchers and practitioners.

14.6LGJul 25, 2021Code

Federated Causal Inference in Heterogeneous Observational Data

Ruoxuan Xiong, Allison Koenecke, Michael Powell et al.

We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the average treatment effects of combined data across sites. Our methods first compute summary statistics locally using propensity scores and then aggregate these statistics across sites to obtain point and variance estimators of average treatment effects. We show that these estimators are consistent and asymptotically normal. To achieve these asymptotic properties, we find that the aggregation schemes need to account for the heterogeneity in treatment assignments and in outcomes across sites. We demonstrate the validity of our federated methods through a comparative study of two large medical claims databases.

4.3GNNov 2, 2020

Synthetic Data Generation for Economists

Allison Koenecke, Hal Varian

As more tech companies engage in rigorous economic analyses, we are confronted with a data problem: in-house papers cannot be replicated due to use of sensitive, proprietary, or private data. Readers are left to assume that the obscured true data (e.g., internal Google information) indeed produced the results given, or they must seek out comparable public-facing data (e.g., Google Trends) that yield similar results. One way to ameliorate this reproducibility issue is to have researchers release synthetic datasets based on their true data; this allows external parties to replicate an internal researcher's methodology. In this brief overview, we explore synthetic data generation at a high level for economic analyses.

0.7CLApr 15, 2019

Learning Twitter User Sentiments on Climate Change with Limited Labeled Data

Allison Koenecke, Jordi Feliu-Fabà

While it is well-documented that climate change accepters and deniers have become increasingly polarized in the United States over time, there has been no large-scale examination of whether these individuals are prone to changing their opinions as a result of natural external occurrences. On the sub-population of Twitter users, we examine whether climate change sentiment changes in response to five separate natural disasters occurring in the U.S. in 2018. We begin by showing that relevant tweets can be classified with over 75% accuracy as either accepting or denying climate change when using our methodology to compensate for limited labeled data; results are robust across several machine learning models and yield geographic-level results in line with prior research. We then apply RNNs to conduct a cohort-level analysis showing that the 2018 hurricanes yielded a statistically significant increase in average tweet sentiment affirming climate change. However, this effect does not hold for the 2018 blizzard and wildfires studied, implying that Twitter users' opinions on climate change are fairly ingrained on this subset of natural disasters.