Yoshi Meke Bird

9.4CLJul 8

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

2.7CLDec 14, 2024

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer et al.

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

1.2SIMay 15, 2023

An assessment of measuring local levels of homelessness through proxy social media signals

Yoshi Meke Bird, Sarah E. Grobe, Michael V. Arnold et al.

Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``homeless.'' Correlations between homelessness-related tweet counts and ranked per capita homelessness volume, but not general-population densities, suggest a relationship between the likelihood of Twitter users to personally encounter or observe homelessness in their everyday lives and their likelihood to communicate about it online. An increase to the log-odds of ``homeless'' appearing in an English-language tweet, as well as an acceleration in the increase in average tweet sentiment, suggest that tweets about homelessness are also affected by trends at the nation-scale. Additionally, changes to the lexical content of tweets over time suggest that reversals to the polarity of national or state-level trends may be detectable through an increase in political or service-sector language over the semantics of charity or direct appeals. An analysis of user account type also revealed changes to Twitter-use patterns by accounts authored by individuals versus entities that may provide an additional signal to confirm changes to homelessness density in a given jurisdiction. While a computational approach to social media analysis may provide a low-cost, real-time dataset rich with information about nationwide and localized impacts of homelessness and homelessness policy, we find that practical issues abound, limiting the potential of social media as a proxy to complement other measures of homelessness.

Yoshi Meke Bird

3 Papers