CLAILGJul 2, 2024

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

arXiv:2407.02211v230 citationsh-index: 25
Originality Highly original
AI Analysis

This addresses the issue of expensive and slow inference for users of fine-tuned LLMs in domain-specific tasks, offering a significant cost-saving solution.

The paper tackles the problem of high computational costs and slow inference in fine-tuned large language models due to lengthy prompts, by introducing PromptIntern, which internalizes prompt knowledge during fine-tuning, resulting in a 90% reduction in input tokens, 4.2x faster inference, and 88.3% lower monetary costs.

Recent advances in fine-tuning large language models (LLMs) have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes