In recent years, support for long-context inputs in LLMs (large language models) has advanced rapidly. With Claude 3.5 and Gemini 1.5 Pro supporting contexts exceeding one million tokens, how to efficiently handle their training has become a pressing challenge for researchers and engineers alike. In this article, we take a deep dive into Ulysses Sequence Parallelism (Ulysses SP), a leading approach to sequence parallelism, and explain the mechanisms and possibilities it offers for making million-token-scale LLM training a reality.
This article is Part 1 of a three-part series titled "Efficient LLM Training: Reinforcement Learning, Long Contexts, and Small-Model Breakthroughs," and covers the technical foundations of long-context training. Part 2 will explore training efficiency improvements through reinforcement learning (RL), and Part 3 will cover innovative architectures for small models.
Why Is Long-Context Training So Difficult?
Transformer self-attention requires $O(L^2)$ computation and memory with respect to sequence length $L$. If a sequence grows from 10,000 tokens to 100,000 tokens, memory requirements theoretically balloon by a factor of 100. Tensor parallelism and pipeline parallelism are primarily methods for distributing model weights, and do not fundamentally address the explosive growth in sequence length. The idea that fills this gap is Sequence Parallelism — parallelism along the sequence dimension.
The Core of Ulysses Sequence Parallelism: All-to-All Communication
DeepSpeed Ulysses is a sequence parallelism method proposed by Microsoft, and its core lies in dimensional transposition of Attention using All-to-All collective communication [Source: https://huggingface.co/blog/ulysses-sp].
In standard sequence parallelism, the input sequence is divided equally among N GPUs. Each GPU holds a portion of the sequence, but since self-attention computes interactions between all tokens, information from other GPUs is required. Ulysses solves this problem through the following steps:
- Each GPU computes the Query / Key / Value for the sequence fragment it is responsible for
- All-to-All communication "transposes" the split along the sequence dimension into a split along the head dimension (converting to a form where each GPU is responsible for a specific set of heads across the full sequence)
- Each GPU computes Attention over the complete sequence for the heads it is responsible for
- A second All-to-All restores the Attention output to the original sequence-based partition
This approach allows each GPU to hold and compute over the full sequence for the Attention heads it is assigned, enabling efficient separation of communication and computation. It is especially powerful in environments equipped with high-speed interconnects such as NVLink [Source: https://huggingface.co/blog/ulysses-sp].
Design Differences Between Ring Attention and Ulysses
Ring Attention (Liu et al., 2023) is a major competing method for sequence parallelism. Ring Attention connects GPUs in a ring topology and computes Attention by circulating K/V around the ring in turn, achieving overlap between communication and computation. Because communication volume scales with sequence length, communication costs grow for very long sequences.
Ulysses, on the other hand, is designed so that communication volume depends on the number of attention heads. As a result, for recent models that adopt Grouped Query Attention (GQA) — such as LLaMA 3 and Mistral — it is important to note that the degree of parallelism is limited by the number of Query Groups. The official HuggingFace blog recommends a hybrid strategy combining Ulysses and Ring Attention, with the best approach being to choose between them based on the model's architectural characteristics and available degree of parallelism [Source: https://huggingface.co/blog/ulysses-sp].
Implementation in the HuggingFace Ecosystem
HuggingFace has integrated support for Ulysses SP in conjunction with TRL and Accelerate. The need for long contexts is also growing in reinforcement learning-based training pipelines such as GRPO and PPO, making sequence parallelism an indispensable technical foundation [Source: https://huggingface.co/blog/async-rl-training-landscape].
Key points to keep in mind during implementation are as follows:
- Explicit definition of process groups: Configure sequence parallel groups and data parallel groups separately
- Combination with Flash Attention 2: Pairing with kernel-level memory optimization doubles the effect
- Use of Gradient Checkpointing: An almost essential technique for reducing activation memory
As a benchmark result, it has been reported that a Ulysses SP configuration using 8 A100 GPUs (80GB) can train sequences exceeding one million tokens at comparable throughput — something unattainable on a single GPU.
Summary and Preview of the Next Installment
Ulysses Sequence Parallelism makes million-token-scale LLM training a reality by cleverly leveraging All-to-All communication to transpose Attention computation into the head dimension. Understanding when to use it versus Ring Attention, and being aware of its constraints with GQA models, is key to success in practical applications.
In the next installment, Part 2, we will cover the cutting edge of LLM training with reinforcement learning, including lessons learned from 16 major open-source RL libraries and a detailed look at implementation patterns for asynchronous RL training.
Category: LLM | Tags: シーケンス並列化, 長コンテキスト訓練, DeepSpeed, LLM効率化, HuggingFace
0 件のコメント:
コメントを投稿