tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models
This work addresses a bottleneck in fine-tuning large language models for users in shared computing environments, offering a significant performance boost but is incremental as it builds on existing LoRA and distributed training methods.
The paper tackles the problem of inefficient concurrent LoRA training jobs on shared clusters by introducing tLoRA, a framework that enables efficient batch training through an elastic shared super-model and adaptive scheduling, resulting in improvements such as 1.2-1.8x higher training throughput and 2.3-5.4x faster job completion times.
As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.