Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
For researchers training large language models with long contexts, FCP addresses the inefficiency and workload imbalance caused by variable sequence lengths in existing context parallelism methods.
FCP introduces a flexible context parallelism paradigm that shards sequences at block-level granularity and uses arbitrary peer-to-peer communication to handle variable-length sequences in foundation model pretraining, achieving 1.13x-2.21x improvement in attention MFU and near-linear scalability on up to 256 GPUs.
Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to 256 NVIDIA GPUs, with 1.13x-2.21x improvement in the attention MFU.