Jeffrey Bigham

CL
h-index27
12papers
1,981citations
Novelty38%
AI Score40

12 Papers

CLJul 25, 2022
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

Jessica Huynh, Ting-Rui Chiang, Jeffrey Bigham et al.

Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.

AIJul 29, 2024
Apple Intelligence Foundation Language Models

Tom Gunter, Zirui Wang, Chong Wang et al.

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

CVSep 30, 2024
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

Yi-Hao Peng, Faria Huq, Yue Jiang et al.

Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.

LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

CVOct 21, 2025
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

Shruti Palaskar, Leon Gatys, Mona Abdelrahman et al.

Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.

CLNov 9, 2021
A Survey of NLP-Related Crowdsourcing HITs: what works and what does not

Jessica Huynh, Jeffrey Bigham, Maxine Eskenazi

Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (HIT) approval score, which is unfair to the good workers. It also has the effect of giving the requester a bad reputation on the workers' forums. Some of the issues causing the mass rejections stem from the requesters not taking the time to create a well-formed task with complete instructions and/or not paying a fair wage. To explore this assumption, this paper describes a study that looks at the crowdsourcing HITs on AMT that were available over a given span of time and records information about those HITs. This study also records information from a crowdsourcing forum on the worker perspective on both those HITs and on their corresponding requesters. Results reveal issues in worker payment and presentation issues such as missing instructions or HITs that are not doable.

CLSep 10, 2021
Does Pretraining for Summarization Require Knowledge Transfer?

Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton

Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer.

HCDec 30, 2020
The Challenges of Crowd Workers in Rural and Urban America

Claudia Flores-Saviaga, Yuwen Li, Benjamin V. Hanrahan et al.

Crowd work has the potential of helping the financial recovery of regions traditionally plagued by a lack of economic opportunities, e.g., rural areas. However, we currently have limited information about the challenges facing crowd work-ers from rural and super rural areas as they struggle to make a living through crowd work sites. This paper examines the challenges and advantages of rural and super rural AmazonMechanical Turk (MTurk) crowd workers and contrasts them with those of workers from urban areas. Based on a survey of421 crowd workers from differing geographic regions in theU.S., we identified how across regions, people struggled with being onboarded into crowd work. We uncovered that despite the inequalities and barriers, rural workers tended to be striving more in micro-tasking than their urban counterparts. We also identified cultural traits, relating to time dimension and individualism, that offer us an insight into crowd workers and the necessary qualities for them to succeed on gig platforms. We finish by providing design implications based on our findings to create more inclusive crowd work platforms and tools

LGDec 4, 2020
Challenging common interpretability assumptions in feature attribution explanations

Jonathan Dinu, Jeffrey Bigham, J. Zico Kolter

As machine learning and algorithmic decision making systems are increasingly being leveraged in high-stakes human-in-the-loop settings, there is a pressing need to understand the rationale of their predictions. Researchers have responded to this need with explainable AI (XAI), but often proclaim interpretability axiomatically without evaluation. When these systems are evaluated, they are often tested through offline simulations with proxy metrics of interpretability (such as model complexity). We empirically evaluate the veracity of three common interpretability assumptions through a large scale human-subjects experiment with a simple "placebo explanation" control. We find that feature attribution explanations provide marginal utility in our task for a human decision maker and in certain cases result in worse decisions due to cognitive and contextual confounders. This result challenges the assumed universal benefit of applying these methods and we hope this work will underscore the importance of human evaluation in XAI research. Supplemental materials -- including anonymized data from the experiment, code to replicate the study, an interactive demo of the experiment, and the models used in the analysis -- can be found at: https://doi.pizza/challenging-xai.

HCMay 8, 2020
Becoming the Super Turker: Increasing Wages via a Strategy from High Earning Workers

Saiph Savage, Chun-Wei Chiang, Susumu Saito et al.

Crowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most workers use these tools, they still earn less than minimum wage. We argue that the missing element is guidance on how to use transparency information. In this paper, we explore how novice workers can improve their earnings by following the transparency criteria of Super Turkers, i.e., crowd workers who earn higher salaries on Amazon Mechanical Turk (MTurk). We believe that Super Turkers have developed effective processes for using transparency information. Therefore, by having novices follow a Super Turker criteria (one that is simple and popular among Super Turkers), we can help novices increase their wages. For this purpose, we: (i) conducted a survey and data analysis to computationally identify a simple yet common criteria that Super Turkers use for handling transparency tools; (ii) deployed a two-week field experiment with novices who followed this Super Turker criteria to find better work on MTurk. Novices in our study viewed over 25,000 tasks by 1,394 requesters. We found that novices who utilized this Super Turkers' criteria earned better wages than other novices. Our results highlight that tool development to support crowd workers should be paired with educational opportunities that teach workers how to effectively use the tools and their related metrics (e.g., transparency values). We finish with design recommendations for empowering crowd workers to earn higher salaries.

HCMar 17, 2019
TurkScanner: Predicting the Hourly Wage of Microtasks

Susumu Saito, Chun-Wei Chiang, Saiph Savage et al.

Workers in crowd markets struggle to earn a living. One reason for this is that it is difficult for workers to accurately gauge the hourly wages of microtasks, and they consequently end up performing labor with little pay. In general, workers are provided with little information about tasks, and are left to rely on noisy signals, such as textual description of the task or rating of the requester. This study explores various computational methods for predicting the working times (and thus hourly wages) required for tasks based on data collected from other workers completing crowd work. We provide the following contributions. (i) A data collection method for gathering real-world training data on crowd-work tasks and the times required for workers to complete them; (ii) TurkScanner: a machine learning approach that predicts the necessary working time to complete a task (and can thus implicitly provide the expected hourly wage). We collected 9,155 data records using a web browser extension installed by 84 Amazon Mechanical Turk workers, and explored the challenge of accurately recording working times both automatically and by asking workers. TurkScanner was created using ~150 derived features, and was able to predict the hourly wages of 69.6% of all the tested microtasks within a 75% error. Directions for future research include observing the effects of tools on people's working practices, adapting this approach to a requester tool for better price setting, and predicting other elements of work (e.g., the acceptance likelihood and worker task preferences.)

CYDec 14, 2017
A Data-Driven Analysis of Workers' Earnings on Amazon Mechanical Turk

Kotaro Hara, Abi Adams, Kristy Milland et al.

A growing number of people are working as part of on-line crowd work, which has been characterized by its low wages; yet, we know little about wage distribution and causes of low/high earnings. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~\$2/h, and only 4% earned more than \$7.25/h. The average requester pays more than \$11/h, although lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is included in our wage calculations, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs future platform design and worker tools to create a more positive future for crowd work.