CLAIOct 20, 2025

DVAGen: Dynamic Vocabulary Augmented Generation

arXiv:2510.17115v11 citationsh-index: 4Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses a flexibility limitation in language models for handling diverse token combinations, but it is incremental as it builds on existing dynamic vocabulary approaches.

The paper tackles the problem of language models struggling with novel or out-of-vocabulary words by introducing DVAGen, a unified open-source framework for dynamic vocabulary-augmented generation, which improves inference throughput through batch inference support.

Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes