David Stillwell

h-index35

10papers

245citations

Novelty51%

AI Score45

Ranked #66,679 of 201,326 authors (top 33%)#3,991 in AI (top 28%)

10 Papers

AIOct 25, 2023

Evaluating General-Purpose AI with Psychometrics

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo et al. · cambridge

Comprehensive and accurate evaluation of general-purpose AI systems such as large language models allows for effective mitigation of their risks and deepened understanding of their capabilities. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems, as present techniques lack a scientific foundation for predicting their performance on unforeseen tasks and explaining their varying performance on specific task items or user inputs. Moreover, existing benchmarks of specific tasks raise growing concerns about their reliability and validity. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation. Psychometrics, the science of psychological measurement, provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework to put it into practice. Finally, we explore future opportunities of integrating psychometrics with the evaluation of general-purpose AI systems.

CLMay 18

Multi-agent AI systems outperform human teams in creativity

Tiancheng Hu, Yixuan Jiang, Haotian Li et al.

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

AIMar 9, 2025

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed et al. · cambridge

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)

AIDec 4, 2024

Large Language Models show both individual and collective creativity comparable to humans

Luning Sun, Yuzhuo Yuan, Yuan Yao et al. · cambridge

Artificial intelligence has, so far, largely automated routine tasks, but what does it mean for the future of work if Large Language Models (LLMs) show creativity comparable to humans? To measure the creativity of LLMs holistically, the current study uses 13 creative tasks spanning three domains. We benchmark the LLMs against individual humans, and also take a novel approach by comparing them to the collective creativity of groups of humans. We find that the best LLMs (Claude and GPT-4) rank in the 52nd percentile against humans, and overall LLMs excel in divergent thinking and problem solving but lag in creative writing. When questioned 10 times, an LLM's collective creativity is equivalent to 8-10 humans. When more responses are requested, two additional responses of LLMs equal one extra human. Ultimately, LLMs, when optimally applied, may compete with a small group of humans in the future of work.

CLOct 9, 2025

A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

Joe Watson, Ivan O'Conner, Chia-Wen Chen et al. · cambridge

Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.

CLOct 29, 2021

The Golden Rule as a Heuristic to Measure the Fairness of Texts Using Machine Learning

Ahmed Izzidien, David Stillwell

In this paper we present a natural language programming framework to consider how the fairness of acts can be measured. For the purposes of the paper, a fair act is defined as one that one would be accepting of if it were done to oneself. The approach is based on an implementation of the golden rule (GR) in the digital domain. Despite the GRs prevalence as an axiom throughout history, no transfer of this moral philosophy into computational systems exists. In this paper we consider how to algorithmically operationalise this rule so that it may be used to measure sentences such as: the boy harmed the girl, and categorise them as fair or unfair. A review and reply to criticisms of the GR is made. A suggestion of how the technology may be implemented to avoid unfair biases in word embeddings is made - given that individuals would typically not wish to be on the receiving end of an unfair act, such as racism, irrespective of whether the corpus being used deems such discrimination as praiseworthy.

AIOct 29, 2021

Measuring a Texts Fairness Dimensions Using Machine Learning Based on Social Psychological Factors

Ahmed Izzidien, David Stillwell

Fairness is a principal social value that can be observed in civilisations around the world. A manifestation of this is in social agreements, often described in texts, such as contracts. Yet, despite the prevalence of such, a fairness metric for texts describing a social act remains wanting. To address this, we take a step back to consider the problem based on first principals. Instead of using rules or templates, we utilise social psychology literature to determine the principal factors that humans use when making a fairness assessment. We then attempt to digitise these using word embeddings into a multi-dimensioned sentence level fairness perceptions vector to serve as an approximation for these fairness perceptions. The method leverages a pro-social bias within word embeddings, for which we obtain an F1= 81.0. A second approach, using PCA and ML based on the said fairness approximation vector produces an F1 score of 86.2. We detail improvements that can be made in the methodology to incorporate the projection of sentence embedding on to a subspace representation of fairness.

CVAug 3, 2017

What your Facebook Profile Picture Reveals about your Personality

Cristina Segalin, Fabio Celli, Luca Polonio et al.

People spend considerable effort managing the impressions they give others. Social psychologists have shown that people manage these impressions differently depending upon their personality. Facebook and other social media provide a new forum for this fundamental process; hence, understanding people's behaviour on social media could provide interesting insights on their personality. In this paper we investigate automatic personality recognition from Facebook profile pictures. We analyze the effectiveness of four families of visual features and we discuss some human interpretable patterns that explain the personality traits of the individuals. For example, extroverts and agreeable individuals tend to have warm colored pictures and to exhibit many faces in their portraits, mirroring their inclination to socialize; while neurotic ones have a prevalence of pictures of indoor places. Then, we propose a classification approach to automatically recognize personality traits from these visual features. Finally, we compare the performance of our classification approach to the one obtained by human raters and we show that computer-based classifications are significantly more accurate than averaged human-based classifications for Extraversion and Neuroticism.

CLMay 22, 2017

Latent Human Traits in the Language of Social Media: An Open-Vocabulary Approach

Vivek Kulkarni, Margaret L. Kern, David Stillwell et al.

Over the past century, personality theory and research has successfully identified core sets of characteristics that consistently describe and explain fundamental differences in the way people think, feel and behave. Such characteristics were derived through theory, dictionary analyses, and survey research using explicit self-reports. The availability of social media data spanning millions of users now makes it possible to automatically derive characteristics from language use -- at large scale. Taking advantage of linguistic information available through Facebook, we study the process of inferring a new set of potential human traits based on unprompted language use. We subject these new traits to a comprehensive set of evaluations and compare them with a popular five factor model of personality. We find that our language-based trait construct is often more generalizable in that it often predicts non-questionnaire-based outcomes better than questionnaire-based traits (e.g. entities someone likes, income and intelligence quotient), while the factors remain nearly as stable as traditional factors. Our approach suggests a value in new constructs of personality derived from everyday human language use.

CVJun 29, 2016

How smart does your profile image look? Estimating intelligence from social network profile images

Xingjie Wei, David Stillwell

Profile images on social networks are users' opportunity to present themselves and to affect how others judge them. We examine what Facebook images say about users' perceived and measured intelligence. 1,122 Facebook users completed a matrices intelligence test and shared their current Facebook profile image. Strangers also rated the images for perceived intelligence. We use automatically extracted image features to predict both measured and perceived intelligence. Intelligence estimation from images is a difficult task even for humans, but experimental results show that human accuracy can be equalled using computing methods. We report the image features that predict both measured and perceived intelligence, and highlight misleading features such as "smiling" and "wearing glasses" that are correlated with perceived but not measured intelligence. Our results give insights into inaccurate stereotyping from profile images and also have implications for privacy, especially since in most social networks profile images are public by default.