CLNov 16, 2023

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch

arXiv:2311.09709v29.128 citationsh-index: 45

Originality Incremental advance

AI Analysis

This addresses efficiency problems for LLM deployment, but it is incremental as it adapts existing vocabulary trimming methods to LLMs.

The paper tackles the computational and memory challenges of deploying large language models by applying vocabulary trimming using language heuristics, finding it reduces memory usage by nearly 50% for small models and improves generation speed by up to 25%.

Deploying large language models (LLMs) encounters challenges due to intensive computational and memory requirements. Our research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. While such modifications have been proven effective in tasks like machine translation, tailoring them to LLMs demands specific modifications given the diverse nature of LLM applications. We apply two language heuristics to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different LLM families and sizes. The methods are straightforward, interpretable, and easy to implement. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed. Yet, we reveal the limitations of these methods in that they do not perform consistently well for each language with diminishing returns in larger models.

View on arXiv PDF

Similar