Evelina Leivada

CL
h-index17
15papers
185citations
Novelty41%
AI Score52

15 Papers

CYJun 3
Prioritization of Risks from Artificial Intelligence: A Delphi Study of 272 International Experts

Alexander K. Saeri, Jess Graham, Michael Noetel et al.

Artificial intelligence poses many risks, ranging from familiar present-day harms to unprecedented and potentially catastrophic ones. Effective risk management requires prioritization: we must understand which risks are most severe, who is most vulnerable, and who is most responsible for addressing them. We report results from a three-round Delphi study conducted late 2025 with 272 international AI experts. Experts rated 24 AI risks on harm probability and severity, sector and actor vulnerability, actor responsibility, and overall concern. Experts estimated the five most severe harms in the next 5 years were likely to come from dangerous capabilities, competitive dynamics, weapons & cyberattacks (including CBRNE), power centralization, and false information. In a business-as-usual scenario, experts judged 18 of 24 risks as having a more than 10% probability of catastrophic outcomes (e.g., more than 1 million deaths or more than USD 100B in financial loss) in the next 5 years (2025-2030). In a scenario where pragmatic mitigations are implemented, experts still judged five risks as having a more than 10% probability of catastrophic outcomes: dangerous capabilities, weapons & cyberattacks, environmental harm, inequality & unemployment, and power centralization. All 24 risks were judged as being more than 5% likely to cause catastrophic outcomes. AI users and the general public were judged the most vulnerable to these risks, but experts assigned the highest responsibility for addressing them to general-purpose AI developers and governance actors (including governments, regulators, and standards bodies). Across most risks, experts identified information, finance, and national security as the most vulnerable sectors. These findings can guide AI risk prioritization and clarify expert expectations about who should bear responsibility for mitigation.

CLFeb 23, 2023
Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Vittoria Dentella, Fritz Guenther, Elliot Murphy et al.

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

CLOct 23, 2022
DALL-E 2 Fails to Reliably Capture Common Syntactic Processes

Evelina Leivada, Elliot Murphy, Gary Marcus

Machine intelligence is increasingly being linked to claims about sentience, language processing, and an ability to comprehend and transform natural language into a range of stimuli. We systematically analyze the ability of DALL-E 2 to capture 8 grammatical phenomena pertaining to compositionality that are widely discussed in linguistics and pervasive in human language: binding principles and coreference, passives, word order, coordination, comparatives, negation, ellipsis, and structural ambiguity. Whereas young children routinely master these phenomena, learning systematic mappings between syntax and semantics, DALL-E 2 is unable to reliably infer meanings that are consistent with the syntax. These results challenge recent claims concerning the capacity of such systems to understand of human language. We make available the full set of test materials as a benchmark for future testing.

CLJul 26, 2023
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Evelina Leivada, Gary Marcus, Fritz Günther et al.

Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.

CLOct 17, 2023
The Quo Vadis of the Relationship between Language and Large Language Models

Evelina Leivada, Vittoria Dentella, Elliot Murphy

In the field of Artificial (General) Intelligence (AI), the several recent advancements in Natural language processing (NLP) activities relying on Large Language Models (LLMs) have come to encourage the adoption of LLMs as scientific models of language. While the terminology employed for the characterization of LLMs favors their embracing as such, it is not clear that they are in a place to offer insights into the target system they seek to represent. After identifying the most important theoretical and empirical risks brought about by the adoption of scientific models that lack transparency, we discuss LLMs relating them to every scientific model's fundamental components: the object, the medium, the meaning and the user. We conclude that, at their current stage of development, LLMs hardly offer any explanations for language, and then we provide an outlook for more informative future research directions on this topic.

CLMay 20
Tracing the ongoing emergence of human-like reasoning in Large Language Models

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther et al.

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

CLFeb 23
Multilingual Large Language Models do not comprehend all natural languages to equal degrees

Natalia Moskvina, Raquel Montero, Masaya Yoshida et al.

Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.

CLNov 11, 2025
Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

Raquel Montero, Natalia Moskvina, Paolo Morosi et al.

Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.

CLMay 8
Post-training makes large language models less human-like

Marcel Binz, Elif Akata, Abdullah Almaatouq et al.

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

CLMay 14, 2025
Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Philipp Schoenegger, Francesco Salvi, Jiacheng Liu et al. · oxford

We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

CLApr 23, 2024
Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference

Vittoria Dentella, Fritz Guenther, Evelina Leivada

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

CLOct 14, 2025
Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test

Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi

The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.

CLFeb 15, 2025
Fundamental Principles of Linguistic Structure are Not Represented by o3

Elliot Murphy, Evelina Leivada, Vittoria Dentella et al.

A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons ('Escher sentences'); its fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality (Marcus 2022), but that it is hitting [a [stubbornly [resilient wall]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute.

CLSep 18, 2025
Large Language Model probabilities cannot distinguish between possible and impossible language

Evelina Leivada, Raquel Montero, Paolo Morosi et al.

A controversial test for Large Language Models concerns the ability to discern possible from impossible language. While some evidence attests to the models' sensitivity to what crosses the limits of grammatically impossible language, this evidence has been contested on the grounds of the soundness of the testing material. We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction. In a novel benchmark, we elicit probabilities from 4 models and compute minimal-pair surprisal differences, juxtaposing probabilities assigned to grammatical sentences to probabilities assigned to (i) lower frequency grammatical sentences, (ii) ungrammatical sentences, (iii) semantically odd sentences, and (iv) pragmatically odd sentences. The prediction is that if string-probabilities can function as proxies for the limits of grammar, the ungrammatical condition will stand out among the conditions that involve linguistic violations, showing a spike in the surprisal rates. Our results do not reveal a unique surprisal signature for ungrammatical prompts, as the semantically and pragmatically odd conditions consistently show higher surprisal. We thus demonstrate that probabilities do not constitute reliable proxies for model-internal representations of syntactic knowledge. Consequently, claims about models being able to distinguish possible from impossible language need verification through a different methodology.

CLAug 22, 2025
ChatGPT-generated texts show authorship traits that identify them as non-human

Vittoria Dentella, Weihang Huang, Silvia Angela Mansi et al.

Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.