CLAIApr 3

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

arXiv:2604.0347244.4h-index: 5
AI Analysis

Addresses the diversity collapse problem in co-evolutionary self-play for LLMs, enabling more effective autonomous curriculum learning without human supervision.

Co-evolutionary self-play for LLMs suffers from diversity collapse, where the proposer generates a narrow set of problems. Vocabulary dropout, a random mask on output logits, sustains diversity and improves solver accuracy by +4.4 points on average for Qwen3-8B on mathematical reasoning.

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes