CLAILGJan 8

Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

arXiv:2601.04465v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the challenge of fine-tuning LLMs without retraining, offering a lightweight control mechanism for applications like reducing hallucinations and enhancing pedagogical feedback, though it is incremental in its approach.

The paper tackles the problem of controlling behavior in frozen large language models by proposing Concept Tokens, a method that learns a single token embedding from concept definitions to steer model outputs, resulting in reduced hallucinations in question answering and improved instruction compliance compared to in-context definitions.

We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes