Allocation Mode

Allocation Mode#

This document describes AReaL’s allocation mode system, which controls how GPUs are distributed between inference and training backends during distributed RL training.

Overview#

The allocation_mode configuration option is a pattern-based string that specifies:

Which backends to use for inference (SGLang, vLLM) and training (FSDP, Megatron, Archon)
The parallelization strategy for each backend
The total number of GPUs required

AReaL parses this string into an AllocationMode object that orchestrates resource allocation across the cluster.

Syntax#

Basic Format#

<backend>:<parallelism_dims>

Two-Component Format (Inference + Training)#

<inference_backend>:<dims> + <training_backend>:<dims>

The + operator separates components that run on separate GPU pools.

Parallelism Dimensions#

Dimension	Abbreviation	Description	Valid For
Data	`d`	Number of model replicas	All backends
Tensor	`t`	Split operations across GPUs	All backends
Pipeline	`p`	Split layers across GPUs in stages	Megatron, Archon
Context	`c`	Split sequence length across GPUs	All backends
Expert	`e`	Split MoE experts across GPUs	Megatron, Archon

Dimensions are specified as <abbrev><size>, e.g., d4t2 means data parallel size 4 and tensor parallel size 2.

Calculating GPU Requirements#

The total GPUs for a component is computed as:

world_size = dp × tp × pp × cp

Expert parallelism (e) does not increase world size—it redistributes how experts are placed within the existing GPU mesh.

Examples#

Allocation Mode	Inference GPUs	Training GPUs	Total
`d8`	-	8	8
`sglang:d2t4`	8	-	8
`sglang:d2t4 + fsdp:d4t2`	8	8	16
`sglang:d4t4 + megatron:d2p2t4e4`	16	16	32

Backend Selection#

Inference Backends#

Backend	Supported Dimensions
`sglang`	`d`, `t`
`vllm`	`d`, `t`, `p`

For inference, d represents the number of independent server instances, and each instance uses t × p GPUs.

Note that the internal backed configurations do not affect how AReaL allocate GPUs. Given allocation mode sglang:d4t4, you can also config sglang.dp_size=4, sglang.ep_size=4, and sglang.enable_dp_attention=True. In this case, we launch 4 model replicas each with 4 GPUs. Within each instance, SGLang will still use DP attention and expert parallelism to distribute computations in attention and expert layers.

Training Backends#

Backend	Supported Dimensions	Use Case
`fsdp`	`d`, `t`, `c`	Default for simple parallelism
`megatron`	`d`, `t`, `p`, `c`, `e`	Required for pipeline or expert parallel
`archon`	`d`, `t`, `p`, `c`, `e`	Alternative to Megatron (experimental)

When the backend is omitted, AReaL auto-selects based on the parallelism configuration:

FSDP: Used when only d, t, c are specified
Megatron: Used when p > 1 or e > 1

# Equivalent forms
d4t2           # Auto-selects FSDP
fsdp:d4t2      # Explicit FSDP

d2p2t4         # Auto-selects Megatron (pp > 1)
megatron:d2p2t4  # Explicit Megatron

MoE Hybrid Parallelism#

For Mixture-of-Experts models, Megatron/Archon supports different parallelism strategies for attention and FFN (expert) modules using the hybrid syntax:

megatron:(attn:<attn_dims>|ffn:<ffn_dims>)

This enables MoE Parallel Folding, which reduces the minimum GPU requirement for combined context and expert parallelism.

Constraints#

Pipeline parallel size (p) must be identical for attn and ffn
World size must match (if d is omitted in ffn, it is derived automatically)
Expert parallel (e) is only valid in the ffn section

Example#

megatron:(attn:d4p2t2c2|ffn:d2p2t4e2)

Module	dp	pp	tp	cp	ep	World Size
attn	4	2	2	2	-	32
ffn	2	2	4	-	2	32