CLMar 31

Concept Training for Human-Aligned Language Models

Christine Zhang, Dan Jurafsky, Chen Shani

arXiv:2603.291238.6h-index: 6

AI Analysis

This addresses the problem of semantic misalignment in language models for NLP applications, though it is an incremental improvement over existing training objectives.

The paper tackles the limitation of next-token prediction in language models by introducing concept training, which predicts sets of semantically related tokens instead of single tokens. The result shows improved alignment with human semantic similarity judgments on lexical benchmarks, with lower perplexity on meaningful words but a modest increase in global perplexity.

The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

View on arXiv PDF

Similar