Calla G. Beauregard

4.3CLJul 8

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

7.8CYJul 8

Buffy versus Bella: An archetypometric analysis and comparison

Calla Glavin Beauregard, Julia Witte Zimmerman, Ashley M. A. Fehr et al.

Fictional stories and characters embody and encode social norms, and their study is a powerful tool through which to understand culture and society. Vampire stories and folklore, in particular, have long both reflected and refracted people's preoccupation with disease, sexuality, death, and immortality. Here, we explore female main characters from two popular vampire franchises of the 21st century: Buffy Summers from the eponymous Buffy the Vampire Slayer and Bella Swan from the Twilight series. We employ the archetypometrics framework, built from 2,000 characters assesed across 464 semantic differential traits, to understand Buffy's and Bella's archetypes compared to one another and characters in their own stories, as well as within a larger societal context. While Buffy and Bella are female protagonists who share focus on love and romance, they differ broadly on their underlying traits and overall archetypes. Buffy -- presented as a prototypical high school cheerleader -- largely bucks traditional gender norms as an strong Adventurer-Hero. Bella -- stylized as ``not like the other girls'' -- largely conforms to traditional gender norms as a weak Outcast archetype. In each instance, our use of archetypometrics offers a detailed, character-based lens for assessing female protagonists in contemporary vampire narratives, with clear potential for broader application across other storytelling forms.

6.3CYJul 9

The queer Hero versus the Fool bias of the queer trait: An archetypometric analysis of the collective portrayal of queerness in fictional stories

Ashley M. A. Fehr, Calla Glavin Beauregard, Julia Witte Zimmerman et al.

Visibility in media is pivotal for identity development and for broadening societal views of gender and sexuality. Queer representation has increased in recent years, yet damaging stereotypes and tropes persist. Here, we focus on queer portrayal and its perception by audiences in fictional stories (television, film, and literature) by studying characters by their quantified archetypes which are operationalizations of common conceptions such as Hero, Diva, and Outcast. We use the archetypometrics and Fandom's LGBTQIA+ datasets to study samples of fictional characters along the trait differential spanning straight to queer. We find, quantify, and explain a seeming paradox. The characters with the highest queer score present positive primary archetypes and are typically Heroes rather than Fools, Angels rather than Demons, and Adventurers rather than Traditionalists. But evaluation across many stories for the straight-queer trait itself reveals a strong collective-writing bias towards Fool (away from Hero) and no meaningful loading for the other two dimensions. Our analysis offers a population-scale view of the complexities of queer portrayal, while also pointing to risks in blindly training on many-authored story corpora.

6.5CYMar 27

Archetypes and gender in fiction: A data-driven mapping of gender stereotypes in stories

Calla Glavin Beauregard, Julia Witte Zimmerman, Ashley M. A. Fehr et al.

Fictional character representations reflect social norms and biases. For example, women are relatively underrepresented in television and film, irrespective of genre, and are frequently stereotyped in these media. Here, we draw on a data-driven operationalization of archetypes -- archetypometrics -- to explore the characterization of 2,000 canonically male and female characters. From an overall space of six pairs of base archetypes, we find that canonically female characters tend more toward Hero, Adventurer, Diva, and Sophisticate archetypes, while male characters, tend toward Fool, Traditionalist, Outcast, Brute and Outcast types. However, overarching patterns by gender nevertheless sustain traditional stereotypes: The seemingly positive heroic bias toward females is undercut by heroic female characters being more masculine than other female characters. We discuss the societal implications of skewed archetype representation by character gender.

2.7CLDec 19, 2025

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman et al.

Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps' law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

2.7CLJun 26, 2025

A suite of allotaxonometric tools for the comparison of complex systems using rank-turbulence divergence

Jonathan St-Onge, Ashley M. A. Fehr, Carter Ward et al.

Describing and comparing complex systems requires principled, theoretically grounded tools. Built around the phenomenon of type turbulence, allotaxonographs provide map-and-list visual comparisons of pairs of heavy-tailed distributions. Allotaxonographs are designed to accommodate a wide range of instruments including rank- and probability-turbulence divergences, Jenson-Shannon divergence, and generalized entropy divergences. Here, we describe a suite of programmatic tools for rendering allotaxonographs for rank-turbulence divergence in Matlab, Javascript, and Python, all of which have different use cases.

2.7CLDec 14, 2024

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

Calla G. Beauregard

7 Papers