CLDec 13, 2022
On Text-based Personality Computing: Challenges and Future DirectionsQixiang Fang, Anastasia Giachanou, Ayoub Bagheri et al.
Text-based personality computing (TPC) has gained many research interests in NLP. In this paper, we describe 15 challenges that we consider deserving the attention of the research community. These challenges are organized by the following topics: personality taxonomies, measurement quality, datasets, performance evaluation, modelling choices, as well as ethics and fairness. When addressing each challenge, not only do we combine perspectives from both NLP and social sciences, but also offer concrete suggestions. We hope to inspire more valid and reliable TPC research.
CLFeb 27, 2023
Epicurus at SemEval-2023 Task 4: Improving Prediction of Human Values behind Arguments by Leveraging Their DefinitionsChristian Fang, Qixiang Fang, Dong Nguyen
We describe our experiments for SemEval-2023 Task 4 on the identification of human values behind arguments (ValueEval). Because human values are subjective concepts which require precise definitions, we hypothesize that incorporating the definitions of human values (in the form of annotation instructions and validated survey items) during model training can yield better prediction performance. We explore this idea and show that our proposed models perform better than the challenge organizers' baselines, with improvements in macro F1 scores of up to 18%.
CLDec 13, 2022
Improving Stance Detection by Leveraging Measurement Knowledge from Social Sciences: A Case Study of Dutch Political Tweets and Traditional Gender Role DivisionQixiang Fang, Anastasia Giachanou, Ayoub Bagheri
Stance detection concerns automatically determining the viewpoint (i.e., in favour of, against, or neutral) of a text's author towards a target. Stance detection has been applied to many research topics, among which the detection of stances behind political tweets is an important one. In this paper, we apply stance detection to a dataset of tweets from official party accounts in the Netherlands between 2017 and 2021, with a focus on stances towards traditional gender role division, a dividing issue between (some) Dutch political parties. To implement and improve stance detection of traditional gender role division, we propose to leverage an established survey instrument from social sciences, which has been validated for the purpose of measuring attitudes towards traditional gender role division. Based on our experiments, we show that using such a validated survey instrument helps to improve stance detection performance.
76.3CYMar 21
A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and RQixiang Fang, Javier Garcia Bernardo, Erik-Jan van Kesteren
Large language models (LLMs) have become an essential tool for social science and humanities (SSH) researchers who work with text. One particularly valuable application is automating text annotation, a traditionally time-consuming step in preparing data for empirical analysis. Yet many SSH researchers face two challenges: getting started with LLMs and understanding how to address their limitations. Practically, the rapid pace of model development can make LLMs seem inaccessible or intimidating, while even experienced users may overlook how annotation errors can bias downstream statistical analyses (e.g., regression estimates and $p$-values), even when annotation accuracy appears high. This paper provides a comprehensive, step-by-step methodological guide for using LLMs for text annotation in SSH research, with clear Python and R code snippets. We cover (1) how LLMs work and what they can and cannot do; (2) how to identify an LLM-suitable research project and establish minimum data and computational requirements; (3) how to design prompts and run annotation tasks; (4) how to evaluate annotation quality and iteratively refine prompts without overfitting; (5) how to integrate LLM annotations into downstream statistical analyses while accounting for annotation error; and (6) how to manage cost, efficiency, and reproducibility when scaling up annotation. Throughout, we provide intuitive methodological reasoning, concrete examples, code snippets, and best-practice guidance to help researchers confidently and transparently incorporate LLM-based annotation into their scientific workflows.
IRJun 22, 2020Code
Open Source Software for Efficient and Transparent ReviewsRens van de Schoot, Jonathan de Bruin, Raoul Schram et al.
To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool (ASReview) to accelerate the step of screening titles and abstracts. For many tasks - including but not limited to systematic reviews and meta-analyses - the scientific literature needs to be checked systematically. Currently, scholars and practitioners screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that ASReview can yield far more efficient reviewing than manual reviewing, while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice.
CLApr 2, 2024
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade MathematicsQixiang Fang, Daniel L. Oberski, Dong Nguyen
Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs' academic proficiency, often with also an interest in comparing model performance with human test takers'. While such benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics -- a field dedicated to the measurement of latent variables like academic proficiency -- into LLM benchmarking. We make four primary contributions. First, we reflect on current LLM benchmark developments and contrast them with psychometrics-based test development. Second, we introduce PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH enables valid comparison between LLMs and human populations. Third, we demonstrate PATCH by measuring several LLMs' proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on current benchmarking practices. Fourth, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science with human populations.
SIMar 20, 2024
USE: Dynamic User Modeling with Stateful Sequence ModelsZhihan Zhou, Qixiang Fang, Leonardo Neves et al.
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
CLMay 2, 2023
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLPAnya Belz, Craig Thomson, Ehud Reiter et al.
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
CYFeb 18, 2022
Evaluating the Construct Validity of Text Embeddings with Application to Survey QuestionsQixiang Fang, Dong Nguyen, Daniel L Oberski
Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to supposedly meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are valid representations of constructs relevant for social science research. We therefore propose the use of the classic construct validity framework to evaluate the validity of text embeddings. We show how this framework can be adapted to the opaque and high-dimensional nature of text embeddings, with application to survey questions. We include several popular text embedding methods (e.g. fastText, GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct validity analyses. We find evidence of convergent and discriminant validity in some cases. We also show that embeddings can be used to predict respondent's answers to completely new survey questions. Furthermore, BERT-based embedding techniques and the Universal Sentence Encoder provide more valid representations of survey questions than do others. Our results thus highlight the necessity to examine the construct validity of text embeddings before deploying them in social science research.
CLSep 10, 2021
Assessing the Reliability of Word Embedding Gender Bias MeasuresYupei Du, Qixiang Fang, Dong Nguyen
Various measures have been proposed to quantify human-like social biases in word embeddings. However, bias scores based on these measures can suffer from measurement error. One indication of measurement quality is reliability, concerning the extent to which a measure produces consistent results. In this paper, we assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency. Specifically, we investigate the consistency of bias scores across different choices of random seeds, scoring rules and words. Furthermore, we analyse the effects of various factors on these measures' reliability scores. Our findings inform better design of word embedding gender bias measures. Moreover, we urge researchers to be more critical about the application of such measures.