TranSQL+: Serving Large Language Models with SQL on Low-Resource Hardware
This work addresses the problem of efficient LLM inference for users with low-resource hardware, offering a novel deployment approach that leverages mature database features, though it is incremental as it builds on existing SQL and optimization techniques.
The paper tackles the challenge of deploying large language models on resource-constrained devices by introducing TranSQL+, a template-based code generator that translates LLM computation graphs into pure SQL queries for execution in relational databases, achieving up to 20x lower prefill latency and 4x higher decoding speed compared to existing methods like DeepSpeed Inference and Llama.cpp in low-memory and CPU-only configurations.
Deploying Large Language Models (LLMs) on resource-constrained devices remains challenging due to limited memory, lack of GPUs, and the complexity of existing runtimes. In this paper, we introduce TranSQL+, a template-based code generator that translates LLM computation graphs into pure SQL queries for execution in relational databases. Without relying on external libraries, TranSQL+, leverages mature database features, such as vectorized execution and out-of-core processing, for efficient inference. We further propose a row-to-column (ROW2COL) optimization that improves join efficiency in matrix operations. Evaluated on Llama3-8B and DeepSeekMoE models, TranSQL+ achieves up to 20x lower prefill latency and 4x higher decoding speed compared to DeepSpeed Inference and Llama.cpp in low-memory and CPU-only configurations. Our results highlight relational databases as a practical environment for LLMs on low-resource hardware.