Aligned at the Start: Conceptual Groupings in LLM Embeddings
This addresses the problem of understanding and mitigating bias in LLMs for AI researchers and practitioners, though it appears incremental in its approach.
The paper analyzes input embeddings in large language models, finding significant categorical community structure aligned with human concepts and showing that manipulating these groupings can reduce ethnicity bias in LLM tasks, with cross-model alignments showing medium to high degrees of alignment.
This paper shifts focus to the often-overlooked input embeddings - the initial representations fed into transformer blocks. Using fuzzy graph, k-nearest neighbor (k-NN), and community detection, we analyze embeddings from diverse LLMs, finding significant categorical community structure aligned with predefined concepts and categories aligned with humans. We observe these groupings exhibit within-cluster organization (such as hierarchies, topological ordering, etc.), hypothesizing a fundamental structure that precedes contextual processing. To further investigate the conceptual nature of these groupings, we explore cross-model alignments across different LLM categories within their input embeddings, observing a medium to high degree of alignment. Furthermore, provide evidence that manipulating these groupings can play a functional role in mitigating ethnicity bias in LLM tasks.