AI LGAug 22, 2024

TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He

arXiv:2408.12320v332.657 citationsh-index: 33

Originality Incremental advance

AI Analysis

This addresses the problem of high costs and inefficiencies in LLM inference for users needing quick, high-quality responses, though it is an incremental improvement over existing routing methods.

The paper tackles the challenge of efficiently balancing quality, cost, and speed in LLM inference by introducing TO-Router, a multi-model router that dynamically routes queries to the best expert model, resulting in up to 40% improved efficiency, 30% cost reduction, and 10% performance enhancement.

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query's requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40\%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.

View on arXiv PDF

Similar