LGAINov 17, 2025

ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

arXiv:2511.13198v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses inefficiencies in LLM training for researchers and practitioners by mitigating out-of-memory and communication-parallelization cancellation issues, though it is incremental as it builds on existing frameworks.

The paper tackles the problem of inefficient parallel strategies for dynamic sequence lengths in Transformer training, proposing ParaDySe, which adaptively switches strategies to optimize memory and communication, achieving improvements on sequences up to 624K.

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes