TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
This addresses cost reduction for users of LLMs in general knowledge domains, but it is incremental as it builds on existing methods for efficiency.
The paper tackles the high inference cost of Large Language Models (LLMs) by proposing TRIM, a pipeline that generates distilled outputs from LLMs and reconstructs them into full narratives with a smaller model, saving 20.58% of tokens on average with minimal accuracy loss.
The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.