Allocation Mode#
This document describes AReaL’s allocation mode system, which controls how GPUs are distributed between inference and training backends during distributed RL training.
Overview#
The allocation_mode configuration option is a pattern-based string that specifies:
Which backends to use for inference (SGLang, vLLM) and training (FSDP, Megatron, Archon)
The parallelization strategy for each backend
The total number of GPUs required
AReaL parses this string into an AllocationMode object that orchestrates resource
allocation across the cluster.
Syntax#
Basic Format#
<backend>:<parallelism_dims>
Two-Component Format (Inference + Training)#
<inference_backend>:<dims> + <training_backend>:<dims>
The + operator separates components that run on separate GPU pools.
Parallelism Dimensions#
Dimension |
Abbreviation |
Description |
Valid For |
|---|---|---|---|
Data |
|
Number of model replicas |
All backends |
Tensor |
|
Split operations across GPUs |
All backends |
Pipeline |
|
Split layers across GPUs in stages |
Megatron, Archon |
Context |
|
Split sequence length across GPUs |
All backends |
Expert |
|
Split MoE experts across GPUs |
Megatron, Archon |
Dimensions are specified as <abbrev><size>, e.g., d4t2 means data parallel size 4
and tensor parallel size 2.
Calculating GPU Requirements#
The total GPUs for a component is computed as:
world_size = dp × tp × pp × cp
Expert parallelism (e) does not increase world size—it redistributes how experts are
placed within the existing GPU mesh.
Examples#
Allocation Mode |
Inference GPUs |
Training GPUs |
Total |
|---|---|---|---|
|
- |
8 |
8 |
|
8 |
- |
8 |
|
8 |
8 |
16 |
|
16 |
16 |
32 |
Backend Selection#
Inference Backends#
Backend |
Supported Dimensions |
|---|---|
|
|
|
|
For inference, d represents the number of independent server instances, and each
instance uses t × p GPUs.
Note that the internal backed configurations do not affect how AReaL allocate GPUs.
Given allocation mode sglang:d4t4, you can also config sglang.dp_size=4,
sglang.ep_size=4, and sglang.enable_dp_attention=True. In this case, we launch 4
model replicas each with 4 GPUs. Within each instance, SGLang will still use DP
attention and expert parallelism to distribute computations in attention and expert
layers.
Training Backends#
Backend |
Supported Dimensions |
Use Case |
|---|---|---|
|
|
Default for simple parallelism |
|
|
Required for pipeline or expert parallel |
|
|
Alternative to Megatron (experimental) |
When the backend is omitted, AReaL auto-selects based on the parallelism configuration:
FSDP: Used when only
d,t,care specifiedMegatron: Used when
p > 1ore > 1
# Equivalent forms
d4t2 # Auto-selects FSDP
fsdp:d4t2 # Explicit FSDP
d2p2t4 # Auto-selects Megatron (pp > 1)
megatron:d2p2t4 # Explicit Megatron
MoE Hybrid Parallelism#
For Mixture-of-Experts models, Megatron/Archon supports different parallelism strategies for attention and FFN (expert) modules using the hybrid syntax:
megatron:(attn:<attn_dims>|ffn:<ffn_dims>)
This enables MoE Parallel Folding, which reduces the minimum GPU requirement for combined context and expert parallelism.
Constraints#
Pipeline parallel size (
p) must be identical forattnandffnWorld size must match (if
dis omitted inffn, it is derived automatically)Expert parallel (
e) is only valid in theffnsection
Example#
megatron:(attn:d4p2t2c2|ffn:d2p2t4e2)
Module |
dp |
pp |
tp |
cp |
ep |
World Size |
|---|---|---|---|---|---|---|
attn |
4 |
2 |
2 |
2 |
- |
32 |
ffn |
2 |
2 |
4 |
- |
2 |
32 |
See Also#
Fine-tuning Large MoE Models - Tutorial for Megatron backend
Archon: PyTorch-Native Training Engine - Tutorial for Archon backend