Fine-tuning Large MoE Models

Fine-tuning Large MoE Models#

Compared to PyTorch FSDP, Megatron-LM supports full 5D parallelism, delivering better scaling and efficiency. AReaL fully supports customized RL training with Megatron-LM as the backend. This guide explains how to harness the Megatron training backend and train large MoE models for your application.

Enabling Megatron in `allocation_mode`#

Shifting from FSDP to Megatron requires only a single line of change: the allocation_mode field from sglang:d4+fsdp:d4 to sglang:d4+megatron:d4.

We already have some internal logic for determining the backend to use if the backend name is omitted. If neither pipline parallelism nor expert parallelism is enabled, FSDP will be used as the backend. Otherwise, Megatron will be used. However, we encourage specifying the backend name explicitly like above.

Understanding `allocation_mode`#

The allocation mode is defined in areal/api/alloc_mode.py. The allocation mode is a pattern-based string option that tells AReaL how to parallelize models across GPUs in training and inference backends. When running the experiment, AReaL converts the string option into an AllocationMode object that stores the backend choice and parallel strategy for each model. For a simple example, sglang:d4+megatron:t4 configures AReaL to use the SGLang backend with data parallel size 4 and the Megatron training backend with tensor parallel size 4.

Training Parallel Strategy#

For a dense model, there are only 4 available parallel dimensions: data parallel (DP, d), tensor parallel (TP, t), pipeline parallel (PP, p), and context parallel (CP, c). The numbers that follow the single-character abbreviation of parallel dimensions describe the parallel size. For example, megatron:d2t4p2c2 describes a 32-GPU parallel strategy that has DP size 2, TP size 4, PP size 2, and CP size 2.

For MoE models, the AReaL allocation mode supports separate parallel strategies for expert modules and attention modules, which is related to the MoE Parallel Folding feature in Megatron. It reduces the minimal number of GPUs required to enable both context and expert parallelism (EP, e), and enables different TP sizes for attention and expert modules for better efficiency. The parallel strategies for attention and expert modules are denoted by attn: and ffn:, and separated by |. For example, megatron:(attn:d1p4t2c2|ffn:d1p4t1e4) describes a 16-GPU parallel strategy with PP size 4, that has DP size 1, TP size 2, and CP size 2 for attention modules and DP size 1, TP size 1, and EP size 4 for expert modules.

5D parallel strategy Tuning Guides:

Inference Parallel Strategy#

The optimal parallel strategy is ususally different for training and inference. Inference parallel strategies only accept DP, TP, and PP, e.g., vllm:d2t4. Note that DP degree is the number of independent instances to deploy. Other parallelism configurations are passed through the sglang and vllm field in configurations, e.g.,

sglang:
  ep_size: 2
  dp_size: 4
  enable_dp_attention: true
  ...

Note that the above configurations controls the internal hybrid parallelism strategy within each inference instance, e.g., DP attention. These techniques are ususally not an orthogonal dimension to DP, TP, and PP that determine GPU allocation. We refer to the large-scale EP delopment guide of SGLang and vLLM for detailed information.

Aligning Inference and Training Precision#

Due to the sparse nature of MoE models, the logits calculated by forward passes during inference and training could be severely misaligned, leading to unstable training results. To mitigate this instability, it is highly recommended to set actor.megatron.use_deterministic_algorithms=True to disable nondeterministic calculations in Megatron, although this may cause a ~10-20% slowdown in training steps.

As an example, you can run GRPO on the Qwen3 30B-A3B MoE model and GSM8K dataset (on a 32-GPU ray cluster) directly with the following command:

# NOTE: Allocation mode here is only for illustration purposes. It is not optimized.
python3 -m areal.launcher.ray examples/math/gsm8k_rl.py --config <megatron_config.yaml> \
    experiment_name=megatron-moe-gsm8k-grpo trial_name=trial-0 allocation_mode=sglang:d4t4+megatron:(attn:d1p4t2c2|ffn:d1p4t1e4) \
    cluster.n_nodes=4 cluster.n_gpus_per_node=8 actor.path=Qwen/Qwen3-30B-A3B \
    actor.megatron.use_deterministic_algorithms=True