CLNov 16, 2023Code
MOKA: Moral Knowledge Augmentation for Moral Event ExtractionXinliang Frederick Zhang, Winston Wu, Nick Beauchamp et al.
News media often strive to minimize explicit moral language in news articles, yet most articles are dense with moral values as expressed through the reported events themselves. However, values that are reflected in the intricate dynamics among participating entities and moral events are far more challenging for most NLP systems to detect, including LLMs. To study this phenomenon, we annotate a new dataset, MORAL EVENTS, consisting of 5,494 structured event annotations on 474 news articles by diverse US media across the political spectrum. We further propose MOKA, a moral event extraction framework with MOral Knowledge Augmentation, which leverages knowledge derived from moral words and moral scenarios to produce structural representations of morality-bearing events. Experiments show that MOKA outperforms competitive baselines across three moral event understanding tasks. Further analysis shows even ostensibly nonpartisan media engage in the selective reporting of moral events. Our data and codebase are available at https://github.com/launchnlp/MOKA.
CLOct 28, 2023Code
Crossing the Aisle: Unveiling Partisan and Counter-Partisan Events in News ReportingKaijian Zou, Xinliang Frederick Zhang, Winston Wu et al.
News media is expected to uphold unbiased reporting. Yet they may still affect public opinion by selectively including or omitting events that support or contradict their ideological positions. Prior work in NLP has only studied media bias via linguistic style and word usage. In this paper, we study to which degree media balances news reporting and affects consumers through event inclusion or omission. We first introduce the task of detecting both partisan and counter-partisan events: events that support or oppose the author's political ideology. To conduct our study, we annotate a high-quality dataset, PAC, containing 8,511 (counter-)partisan event annotations in 304 news articles from ideologically diverse media outlets. We benchmark PAC to highlight the challenges of this task. Our findings highlight both the ways in which the news subtly shapes opinion and the need for large language models that better understand events within a broader context. Our dataset can be found at https://github.com/launchnlp/Partisan-Event-Dataset.
CLNov 4, 2022
Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and AnalysisChangyuan Qiu, Winston Wu, Xinliang Frederick Zhang et al.
Prior work on ideology prediction has largely focused on single modalities, i.e., text or images. In this work, we introduce the task of multimodal ideology prediction, where a model predicts binary or five-point scale ideological leanings, given a text-image pair with political content. We first collect five new large-scale datasets with English documents and images along with their ideological leanings, covering news articles from a wide range of US mainstream media and social media posts from Reddit and Twitter. We conduct in-depth analyses of news articles and reveal differences in image content and usage across the political spectrum. Furthermore, we perform extensive experiments and ablation studies, demonstrating the effectiveness of targeted pretraining objectives on different model components. Our best-performing model, a late-fusion architecture pretrained with a triplet objective over multimodal content, outperforms the state-of-the-art text-only model by almost 4% and a strong multimodal baseline with no pretraining by over 3%.
CLOct 28, 2025
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and CulturesTyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey et al. · uw
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
CLMay 24, 2023
You Are What You Annotate: Towards Better Models through Annotator RepresentationsNaihao Deng, Xinliang Frederick Zhang, Siyang Liu et al.
Annotator disagreement is ubiquitous in natural language processing (NLP) tasks. There are multiple reasons for such disagreements, including the subjectivity of the task, difficult cases, unclear guidelines, and so on. Rather than simply aggregating labels to obtain data annotations, we instead try to directly model the diverse perspectives of the annotators, and explicitly account for annotators' idiosyncrasies in the modeling process by creating representations for each annotator (annotator embeddings) and also their annotations (annotation embeddings). In addition, we propose TID-8, The Inherent Disagreement - 8 dataset, a benchmark that consists of eight existing language understanding datasets that have inherent annotator disagreement. We test our approach on TID-8 and show that our approach helps models learn significantly better from disagreements on six different datasets in TID-8 while increasing model size by fewer than 1% parameters. By capturing the unique tendencies and subjectivity of individual annotators through embeddings, our representations prime AI models to be inclusive of diverse viewpoints.
HCMay 23, 2023
EASE: An Easily-Customized Annotation System Powered by Efficiency Enhancement MechanismsNaihao Deng, Yikai Liu, Mingye Chen et al.
The performance of current supervised AI systems is tightly connected to the availability of annotated datasets. Annotations are usually collected through annotation tools, which are often designed for specific tasks and are difficult to customize. Moreover, existing annotation tools with an active learning mechanism often only support limited use cases. To address these limitations, we present EASE, an Easily-Customized Annotation System Powered by Efficiency Enhancement Mechanisms. \sysname provides modular annotation units for building customized annotation interfaces and also provides multiple back-end options that suggest annotations using (1) multi-task active learning; (2) demographic feature based active learning; (3) a prompt system that can query the API of large language models. We conduct multiple experiments and user studies to evaluate our system's flexibility and effectiveness. Our results show that our system can meet the diverse needs of NLP researchers and significantly accelerate the annotation process.
CLMay 21, 2023
Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language ModelsOana Ignat, Zhijing Jin, Artem Abzaliev et al.
Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that ``it's all been solved.'' Not surprisingly, this has, in turn, made many NLP researchers -- especially those at the beginning of their careers -- worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs. While we identify many research areas, many others exist; we do not cover areas currently addressed by LLMs, but where LLMs lag behind in performance or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm
CLMay 1, 2020
Evaluating Neural Machine Comprehension Model Robustness to Noisy Inputs and Adversarial AttacksWinston Wu, Dustin Arendt, Svitlana Volkova
We evaluate machine comprehension models' robustness to noise and adversarial attacks by performing novel perturbations at the character, word, and sentence level. We experiment with different amounts of perturbations to examine model confidence and misclassification rate, and contrast model performance in adversarial training with different embedding types on two benchmark datasets. We demonstrate improving model performance with ensembling. Finally, we analyze factors that effect model behavior under adversarial training and develop a model to predict model errors during adversarial attacks.
CLOct 3, 2019
Modeling Color Terminology Across Thousands of LanguagesArya D. McCarthy, Winston Wu, Aaron Mueller et al.
There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design---as well as their aggregation---correlate strongly with both the Berlin and Kay basic/secondary color term partition (gamma=0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.