HCMay 27
Fostering human learning is crucial for boosting human-AI synergyJulian Berger, Jason W. Burton, Ralph Hertwig et al.
The collaboration between humans and artificial intelligence (AI) holds the promise of achieving superior outcomes compared to either acting alone-a phenomenon called human-AI synergy. Nevertheless, our understanding of the conditions that facilitate such human-AI synergy when humans are advised by AI remains limited. A recent meta-analysis showed that, on average, human-AI combinations do not outperform the better individual agent. We argue that this pessimistic conclusion arises from insufficient attention to human learning in the experimental designs. To substantiate this claim, we re-analyzed all 74 studies included in the original meta-analysis, yielding two new findings. First, most previous research overlooked design features that foster human learning, such as providing outcome feedback to participants. Second, our re-analysis demonstrated that studies providing outcome feedback show tentatively higher synergy than those without outcome feedback. Crucially, feedback paired with AI explanations tends to yield positive synergy, while explanations without feedback were linked to negative synergy-indicating that explanations increase synergy only when humans can learn to verify the AI's reliability through feedback. We conclude that the current literature underestimates the potential of human-AI collaboration because it predominantly relies on paradigms that do not facilitate human learning, thus hindering humans from effectively adapting their collaboration strategies. We therefore advocate for a paradigm shift in human-AI interaction research that explicitly addresses human learning and thus enhances our understanding of and support for successful human-AI collaboration.
SIMay 26
Mapping the gender attrition gap in academic psychologyXinyi Zhao, Anna I. Thoma, Ralph Hertwig et al.
Women comprise the majority of students and early-career scholars in psychology, yet they are less likely to remain active in research over time. This pattern raises a central question: At what stages of academic careers do women disproportionately leave academia, and what factors drive their attrition? Using large-scale bibliometric data tracking 78,216 psychologists who began publishing between 2000 and 2014, we examine gender differences in research career attrition operationalized through publishing activity across the full trajectory from entry onward. Although women accounted for more than 60\% of new entrants, they experienced higher attrition rates than men, with the gender gap peaking approximately five years after first publication. Early-career performance, particularly first-authored publications, was the strongest predictor of subsequent retention, whereas last-authored publications were most closely associated with continued activity at later career stages. Collaboration patterns and institutional context also shaped career persistence, though less strongly than publication indicators. Notably, gender differences in research attrition persisted even after accounting for these career determinants, especially during early career stages. These findings suggest that gender inequality in psychology is driven less by recruitment than by differential retention over time. Addressing early-career vulnerability may therefore be essential to achieving equitable representation in senior academic leadership within the discipline.
CLJan 25, 2023
Using novel data and ensemble models to improve automated labeling of Sustainable Development GoalsDirk U. Wulff, Dominik S. Meier, Rui Mata
A number of labeling systems based on text have been proposed to help monitor work on the United Nations (UN) Sustainable Development Goals (SDGs). Here, we present a systematic comparison of systems using a variety of text sources and show that systems differ considerably in their specificity (i.e., true-positive rate) and sensitivity (i.e., true-negative rate), have systematic biases (e.g., are more sensitive to specific SDGs relative to others), and are susceptible to the type and amount of text analyzed. We then show that an ensemble model that pools labeling systems alleviates some of these limitations, exceeding the labeling performance of all currently available systems. We conclude that researchers and policymakers should care about the choice of labeling system and that ensemble methods should be favored when drawing conclusions about the absolute and relative prevalence of work on the SDGs based on automated methods.
AIApr 21
Machine individuality: Separating genuine idiosyncrasy from response bias in large language modelsValentin Kriegmair, Dirk U. Wulff
As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models -- widely used in psychometrics to separate systematic effects -- to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality.
AIOct 31, 2025
Advancing Cognitive Science with LLMsDirk U. Wulff, Rui Mata
Cognitive science faces ongoing challenges in knowledge synthesis and conceptual clarity, in part due to its multifaceted and interdisciplinary nature. Recent advances in artificial intelligence, particularly the development of large language models (LLMs), offer tools that may help to address these issues. This review examines how LLMs can support areas where the field has historically struggled, including establishing cross-disciplinary connections, formalizing theories, developing clear measurement taxonomies, achieving generalizability through integrated modeling frameworks, and capturing contextual and individual variation. We outline the current capabilities and limitations of LLMs in these domains, including potential pitfalls. Taken together, we conclude that LLMs can serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human expertise.
CLOct 12, 2021Code
text2sdg: An R package to Monitor Sustainable Development Goals from TextDominik S. Meier, Rui Mata, Dirk U. Wulff
Monitoring progress on the United Nations Sustainable Development Goals (SDGs) is important for both academic and non-academic organizations. Existing approaches to monitoring SDGs have focused on specific data types; namely, publications listed in proprietary research databases. We present the text2sdg package for the R language, a user-friendly, open-source package that detects SDGs in any kind of text data using different existing or custom-made query systems. The text2sdg package thereby facilitates the monitoring of SDGs for a wide array of text sources and provides a much-needed basis for validating and improving extant methods to detect SDGs from text.
CLMay 8
Post-training makes large language models less human-likeMarcel Binz, Elif Akata, Abdullah Almaatouq et al.
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
CLJan 7
Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language modelsTaisiia Tikhomirova, Dirk U. Wulff
Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning "lives" in transformer models reflects an interaction between methodological choices and architectural constraints.
CLApr 21
The "Small World of Words" German Free-Association NormsSamuel Aeschbach, Rui Mata, Kaidi Lõo et al.
Free-association norms provide essential empirical data for investigating linguistic, semantic, and cultural phenomena in the cognitive sciences. Although large-scale norms exist for languages such as English, Dutch, Spanish, and Mandarin Chinese, no comparable resource has been available for German. To address this gap, we present free-association norms for 5,877 German cue words as part of the German version of the multilingual Small World of Words (SWOW) project. We describe the data collection procedures, participant characteristics, and our comprehensive preprocessing pipeline before introducing the resulting SWOW-DE data set. Using data from three established psycholinguistic paradigms, we show that SWOW-DE norms robustly predict performance in lexical decision tasks, relatedness judgments, and psycholinguistic word ratings. Furthermore, we demonstrate that SWOW-DE responses compare favorably with existing German resources and provide a preliminary cross-linguistic comparison revealing both shared and language-specific association patterns, highlighting promising directions for future research. Overall, SWOW-DE represents the largest collection of German free associations to date and offers a unique resource for linguistic, psychological, and cross-cultural research.
CLDec 6, 2024
Probing the contents of semantic representations from text, behavior, and brain data using the psychNorms metabaseZak Hussain, Rui Mata, Ben R. Newell et al.
Semantic representations are integral to natural language processing, psycholinguistics, and artificial intelligence. Although often derived from internet text, recent years have seen a rise in the popularity of behavior-based (e.g., free associations) and brain-based (e.g., fMRI) representations, which promise improvements in our ability to measure and model human representations. We carry out the first systematic evaluation of the similarities and differences between semantic representations derived from text, behavior, and brain data. Using representational similarity analysis, we show that word vectors derived from behavior and brain data encode information that differs from their text-derived cousins. Furthermore, drawing on our psychNorms metabase, alongside an interpretability method that we call representational content analysis, we find that, in particular, behavior representations capture unique variance on certain affective, agentic, and socio-moral dimensions. We thus establish behavior as an important complement to text for capturing human representations and behavior. These results are broadly relevant to research aimed at learning human-aligned semantic representations, including work on evaluating and aligning large language models.
CLOct 23, 2024
Measuring individual semantic networks: A simulation studySamuel Aeschbach, Rui Mata, Dirk U. Wulff
Accurately capturing individual differences in semantic networks is fundamental to advancing our mechanistic understanding of semantic memory. Past empirical attempts to construct individual-level semantic networks from behavioral paradigms may be limited by data constraints. To assess these limitations and propose improved designs for the measurement of individual semantic networks, we conducted a recovery simulation investigating the psychometric properties underlying estimates of individual semantic networks obtained from two different behavioral paradigms: free associations and relatedness judgment tasks. Our results show that successful inference of semantic networks is achievable, but they also highlight critical challenges. Estimates of absolute network characteristics are severely biased, such that comparisons between behavioral paradigms and different design configurations are often not meaningful. However, comparisons within a given paradigm and design configuration can be accurate and generalizable when based on designs with moderate numbers of cues, moderate numbers of responses, and cue sets including diverse words. Ultimately, our results provide insights that help evaluate past findings on the structure of semantic networks and design new studies capable of more reliably revealing individual differences in semantic networks.
MLAug 26, 2016
Estimating the Number of Clusters via Normalized Cluster InstabilityJonas M. B. Haslbeck, Dirk U. Wulff
We improve current instability-based methods for the selection of the number of clusters $k$ in cluster analysis by developing a normalized cluster instability measure that corrects for the distribution of cluster sizes, a previously unaccounted driver of cluster instability. We show that our normalized instability measure outperforms current instability-based measures across the whole sequence of possible $k$ and especially overcomes limitations in the context of large $k$. We also compare, for the first time, model-based and model-free approaches to determine cluster-instability and find their performance to be comparable. We make our method available in the R-package \verb+cstab+.