AICLLGDec 4, 2023

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

arXiv:2312.01648v311 citationsh-index: 17Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses the lack of interpretability in LLMs, offering practical tools for safety and analysis, though it is incremental in applying geometric insights to existing models.

The authors tackled the problem of understanding internal representations in Large Language Models (LLMs) by analyzing their geometry, leading to theoretical results that enabled bypassing RLHF protection and improving toxicity detection with interpretable features.

Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes