AICLLGSYMay 29, 2023

Taming AI Bots: Controllability of Neural States in Large Language Models

arXiv:2305.18449v124 citations
Originality Incremental advance
AI Analysis

This addresses the problem of AI safety and adversarial control for users of large language models, though it is incremental as it builds on existing formalizations of meaning and LLM training.

The paper investigates whether AI bots based on large language models can be controlled to any state via prompts, showing that while any meaning can be reached with small probability, a stronger notion of controllability allows almost certain reachability under certain conditions.

We tackle the question of whether an agent can, by suitable choice of prompts, control an AI bot to any state. To that end, we first introduce a formal definition of ``meaning'' that is amenable to analysis. Then, we characterize ``meaningful data'' on which large language models (LLMs) are ostensibly trained, and ``well-trained LLMs'' through conditions that are largely met by today's LLMs. While a well-trained LLM constructs an embedding space of meanings that is Euclidean, meanings themselves do not form a vector (linear) subspace, but rather a quotient space within. We then characterize the subset of meanings that can be reached by the state of the LLMs for some input prompt, and show that a well-trained bot can reach any meaning albeit with small probability. We then introduce a stronger notion of controllability as {\em almost certain reachability}, and show that, when restricted to the space of meanings, an AI bot is controllable. We do so after introducing a functional characterization of attentive AI bots, and finally derive necessary and sufficient conditions for controllability. The fact that AI bots are controllable means that an adversary could steer them towards any state. However, the sampling process can be designed to counteract adverse actions and avoid reaching undesirable regions of state space before their boundary is crossed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes