Milo Z. Trujillo

CL
h-index36
5papers
21citations
Novelty20%
AI Score25

5 Papers

CLJun 11, 2023
A blind spot for large language models: Supradiegetic linguistic information

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.

SIMay 7, 2025
From Flowers to Fascism? The Cottagecore to Tradwife Pipeline on Tumblr

Oliver Mel Allen, Yi Zu, Milo Z. Trujillo et al.

In this work we collected and analyzed social media posts to investigate aesthetic-based radicalization where users searching for Cottagecore content may find Tradwife content co-opted by white supremacists, white nationalists, or other far-right extremist groups. Through quantitative analysis of over 200,000 Tumblr posts and qualitative coding of about 2,500 Tumblr posts, we did not find evidence of a explicit radicalization. We found that problematic Tradwife posts found in the literature may be confined to Tradwife-only spaces, while content in the Cottagecore tag generally did not warrant extra moderation. However, we did find evidence of a mainstreaming effect in the overlap between the Tradwife and Cottagecore communities. In our qualitative analysis there was more interaction between queer and Tradwife identities than expected based on the literature, and some Tradwives even explicitly included queer people and disavowed racism in the Tradwife community on Tumblr. This could be genuine, but more likely it was an example of extremists re-branding their content and following platform norms to spread ideologies that would otherwise be rejected by Tumblr users. Additionally, through temporal analysis we observed a change in the central tags used by Tradwives in the Cottagecore tag pre- and post- 2021. Initially these posts focused on aesthetics and hobbies like baking and gardening, but post-2021 the central tags focused more on religion, traditional gender roles, and homesteading, all markers of reactionary ideals.

CYJun 29, 2021Code
The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative

Milo Z. Trujillo, Laurent Hébert-Dufresne, James Bagrow

GitHub has become the central online platform for much of open source, hosting most open source code repositories. With this popularity, the public digital traces of GitHub are now a valuable means to study teamwork and collaboration. In many ways, however, GitHub is a convenience sample, and may not be representative of open source development off the platform. Here we develop a novel, extensive sample of public open source project repositories outside of centralized platforms. We characterized these projects along a number of dimensions, and compare to a time-matched sample of corresponding GitHub projects. Our sample projects tend to have more collaborators, are maintained for longer periods, and tend to be more focused on academic and scientific problems.

SEMar 19, 2021Code
Which contributions count? Analysis of attribution in open source

Jean-Gabriel Young, Amanda Casari, Katie McLaughlin et al.

Open source software projects usually acknowledge contributions with text files, websites, and other idiosyncratic methods. These data sources are hard to mine, which is why contributorship is most frequently measured through changes to repositories, such as commits, pushes, or patches. Recently, some open source projects have taken to recording contributor actions with standardized systems; this opens up a unique opportunity to understand how community-generated notions of contributorship map onto codebases as the measure of contribution. Here, we characterize contributor acknowledgment models in open source by analyzing thousands of projects that use a model called All Contributors to acknowledge diverse contributions like outreach, finance, infrastructure, and community management. We analyze the life cycle of projects through this model's lens and contrast its representation of contributorship with the picture given by other methods of acknowledgment, including GitHub's top committers indicator and contributions derived from actions taken on the platform. We find that community-generated systems of contribution acknowledgment make work like idea generation or bug finding more visible, which generates a more extensive picture of collaboration. Further, we find that models requiring explicit attribution lead to more clearly defined boundaries around what is and what is not a contribution.

CLDec 14, 2024
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]