Mark Díaz

CL
h-index18
14papers
3,123citations
Novelty38%
AI Score45

14 Papers

CLNov 9, 2023
GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives

Vinodkumar Prabhakaran, Christopher Homan, Lora Aroyo et al.

Human annotation plays a core role in machine learning -- annotations for supervised models, safety guardrails for generative models, and human feedback for reinforcement learning, to cite a few avenues. However, the fact that many of these human annotations are inherently subjective is often overlooked. Recent work has demonstrated that ignoring rater subjectivity (typically resulting in rater disagreement) is problematic within specific tasks and for specific subgroups. Generalizable methods to harness rater disagreement and thus understand the socio-cultural leanings of subjective tasks remain elusive. In this paper, we propose GRASP, a comprehensive disagreement analysis framework to measure group association in perspectives among different rater sub-groups, and demonstrate its utility in assessing the extent of systematic disagreements in two datasets: (1) safety annotations of human-chatbot conversations, and (2) offensiveness annotations of social media posts, both annotated by diverse rater pools across different socio-demographic axes. Our framework (based on disagreement metrics) reveals specific rater groups that have significantly different perspectives than others on certain tasks, and helps identify demographic axes that are crucial to consider in specific task contexts.

LGNov 28, 2023
SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

Mark Díaz, Sunipa Dev, Emily Reif et al.

The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.

LGFeb 8, 2023
(Re)Defining Expertise in Machine Learning Development

Mark Díaz, Angela D. R. Smith

Domain experts are often engaged in the development of machine learning systems in a variety of ways, such as in data collection and evaluation of system performance. At the same time, who counts as an 'expert' and what constitutes 'expertise' is not always explicitly defined. In this project, we conduct a systematic literature review of machine learning research to understand 1) the bases on which expertise is defined and recognized and 2) the roles experts play in ML development. Our goal is to produce a high-level taxonomy to highlight limits and opportunities in how experts are identified and engaged in ML research.

AIDec 19, 2025
Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally

Robin Schimmelpfennig, Mark Díaz, Vinodkumar Prabhakaran et al.

Over a billion users globally interact with AI systems engineered to mimic human traits. This development raises concerns that anthropomorphism, the attribution of human characteristics to AI, may foster over-reliance and misplaced trust. Yet, causal effects of humanlike AI design on users remain untested in ecologically valid, cross-cultural settings, leaving policy discussions to rely on theoretical assumptions derived largely from Western populations. Here we conducted two experiments (N=3,500) across ten countries representing a wide cultural spectrum, involving real-time, open-ended interactions with a state-of-the-art chatbot. We found users evaluate human-likeness based on pragmatic interactional cues (conversation flow, response speed, perspective-taking) rather than abstract theory-driven attributes emphasized in academic discourse (e.g., sentience, consciousness). Furthermore, while experimentally increasing chatbot's human-likeness reliably increased anthropomorphism across all sampled countries, it did not universally increase trust or engagement. Instead, effects were culturally contingent; design choices fostering engagement or trust in one country may reduce them in another. These findings challenge prevailing assumptions that humanlike AI poses uniform psychological risks and necessarily increases trust. Risk is not inherent to humanlike design but emerges from its interplay with cultural context. Consequently, governance frameworks must move beyond universalist approaches to account for this global heterogeneity.

CYDec 11, 2023
Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates

Aida Davani, Mark Díaz, Dylan Baker et al.

Perception of offensiveness is inherently subjective, shaped by the lived experiences and socio-cultural values of the perceivers. Recent years have seen substantial efforts to build AI-based tools that can detect offensive language at scale, as a means to moderate social media platforms, and to ensure safety of conversational AI technologies such as ChatGPT and Bard. However, existing approaches treat this task as a technical endeavor, built on top of data annotated for offensiveness by a global crowd workforce without any attention to the crowd workers' provenance or the values their perceptions reflect. We argue that cultural and psychological factors play a vital role in the cognitive processing of offensiveness, which is critical to consider in this context. We re-frame the task of determining offensiveness as essentially a matter of moral judgment -- deciding the boundaries of ethically wrong vs. right language within an implied set of socio-cultural norms. Through a large-scale cross-cultural study based on 4309 participants from 21 countries across 8 cultural regions, we demonstrate substantial cross-cultural differences in perceptions of offensiveness. More importantly, we find that individual moral values play a crucial role in shaping these variations: moral concerns about Care and Purity are significant mediating factors driving cross-cultural differences. These insights are of crucial importance as we build AI models for the pluralistic world, where the values they espouse should aim to respect and account for moral values in diverse geo-cultural contexts.

CLApr 16, 2024
D3CODE: Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation

Aida Mostafazadeh Davani, Mark Díaz, Dylan Baker et al.

While human annotations play a crucial role in language technologies, annotator subjectivity has long been overlooked in data collection. Recent studies that have critically examined this issue are often situated in the Western context, and solely document differences across age, gender, or racial groups. As a result, NLP research on subjectivity have overlooked the fact that individuals within demographic groups may hold diverse values, which can influence their perceptions beyond their group norms. To effectively incorporate these considerations into NLP pipelines, we need datasets with extensive parallel annotations from various social and cultural groups. In this paper we introduce the \dataset dataset: a large-scale cross-cultural dataset of parallel annotations for offensive language in over 4.5K sentences annotated by a pool of over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions. The dataset contains annotators' moral values captured along six moral foundations: care, equality, proportionality, authority, loyalty, and purity. Our analyses reveal substantial regional variations in annotators' perceptions that are shaped by individual moral values, offering crucial insights for building pluralistic, culturally sensitive NLP models.

AIFeb 9, 2024
Discipline and Label: A WEIRD Genealogy and Social Theory of Data Annotation

Andrew Smart, Ding Wang, Ellis Monk et al.

Data annotation remains the sine qua non of machine learning and AI. Recent empirical work on data annotation has begun to highlight the importance of rater diversity for fairness, model performance, and new lines of research have begun to examine the working conditions for data annotation workers, the impacts and role of annotator subjectivity on labels, and the potential psychological harms from aspects of annotation work. This paper outlines a critical genealogy of data annotation; starting with its psychological and perceptual aspects. We draw on similarities with critiques of the rise of computerized lab-based psychological experiments in the 1970's which question whether these experiments permit the generalization of results beyond the laboratory settings within which these results are typically obtained. Do data annotations permit the generalization of results beyond the settings, or locations, in which they were obtained? Psychology is overly reliant on participants from Western, Educated, Industrialized, Rich, and Democratic societies (WEIRD). Many of the people who work as data annotation platform workers, however, are not from WEIRD countries; most data annotation workers are based in Global South countries. Social categorizations and classifications from WEIRD countries are imposed on non-WEIRD annotators through instructions and tasks, and through them, on data, which is then used to train or evaluate AI models in WEIRD countries. We synthesize evidence from several recent lines of research and argue that data annotation is a form of automated social categorization that risks entrenching outdated and static social categories that are in reality dynamic and changing. We propose a framework for understanding the interplay of the global social conditions of data annotation with the subjective phenomenological experience of data annotation work.

LGOct 31, 2024
What Makes An Expert? Reviewing How ML Researchers Define "Expert"

Mark Díaz, Angela DR Smith

Human experts are often engaged in the development of machine learning systems to collect and validate data, consult on algorithm development, and evaluate system performance. At the same time, who counts as an 'expert' and what constitutes 'expertise' is not always explicitly defined. In this work, we review 112 academic publications that explicitly reference 'expert' and 'expertise' and that describe the development of machine learning (ML) systems to survey how expertise is characterized and the role experts play. We find that expertise is often undefined and forms of knowledge outside of formal education and professional certification are rarely sought, which has implications for the kinds of knowledge that are recognized and legitimized in ML development. Moreover, we find that expert knowledge tends to be utilized in ways focused on mining textbook knowledge, such as through data annotation. We discuss the ways experts are engaged in ML development in relation to deskilling, the social construction of expertise, and implications for responsible AI development. We point to a need for reflection and specificity in justifications of domain expert engagement, both as a matter of documentation and reproducibility, as well as a matter of broadening the range of recognized expertise.

LGJul 15, 2025
Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models

Charvi Rastogi, Tian Huey Teh, Pushkar Mishra et al.

Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) -- the first multimodal dataset for pluralistic alignment. It enable deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems. Content Warning: The paper includes sensitive content that may be harmful.

HCJul 21, 2025
"Just a strange pic": Evaluating 'safety' in GenAI Image safety annotation tasks from diverse annotators' perspectives

Ding Wang, Mark Díaz, Charvi Rastogi et al.

Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.

CLMay 17, 2023
PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat et al.

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.

LGDec 8, 2021
Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation

Remi Denton, Mark Díaz, Ian Kivlichan et al.

Human annotations play a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into building ML datasets has not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms and what that relationship affords them. Finally, we put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset documentation and release.

CLOct 12, 2021
Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Aida Mostafazadeh Davani, Mark Díaz, Vinodkumar Prabhakaran

Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators' judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

CLOct 12, 2021
On Releasing Annotator-Level Labels and Information in Datasets

Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, Mark Díaz

A common practice in building NLP datasets, especially using crowd-sourced annotations, involves obtaining multiple annotator judgements on the same data instances, which are then flattened to produce a single "ground truth" label or score, through majority voting, averaging, or adjudication. While these approaches may be appropriate in certain annotation tasks, such aggregations overlook the socially constructed nature of human perceptions that annotations for relatively more subjective tasks are meant to capture. In particular, systematic disagreements between annotators owing to their socio-cultural backgrounds and/or lived experiences are often obfuscated through such aggregations. In this paper, we empirically demonstrate that label aggregation may introduce representational biases of individual and group perspectives. Based on this finding, we propose a set of recommendations for increased utility and transparency of datasets for downstream use cases.