Asynchronous RL#
AReaL natively supports asynchronous RL training, enabling overlapped rollout generation and model training on disaggregated GPUs. This architecture maximizes GPU utilization by running inference and training concurrently.
Note: This guide applies to all algorithms when asynchronous training is enabled (i.e.,
rollout.max_head_offpolicyness > 0).
Overview#
Traditional online RL algorithms assume synchronous execution: the model generates rollouts, trains on them, and repeats. While simple, this approach leaves GPUs idle when rollout is long and does not scale well.
Asynchronous RL breaks this constraint by overlapping rollout generation and training. However, this introduces off-policyness: the policy version generating rollouts may lag behind the training version. To maximize inference throughput, AReaL also supports partial rollouts, where a single trajectory can be segmented across multiple policy versions.
Key Techniques#
AReaL addresses the aforementioned algorithmic challenges with two complementary techniques:
1. Off-Policyness Control#
Limit how stale rollouts can be relative to the current training policy:
rollout:
max_head_offpolicyness: 4 # Allow up to 4 version steps behind
Configuration tips:
Set to
0for synchronous RL (useful for debugging or baseline comparisons)Higher values increase throughput but may reduce training stability
Typical range: 2-8 depending on model size and update frequency
2. Decoupled PPO Objective#
Handle off-policy data with modified loss computation:
actor:
use_decoupled_loss: true # Enable decoupled PPO objective
recompute_logprobs: true # Recompute logprobs during training
Configuration options:
use_decoupled_loss: Whenfalse, uses standard PPO/GRPO objectivesrecompute_logprobs: Whenfalse, reuses logprobs from inference backendNote: Must be
truewhenuse_decoupled_lossis enabled
References#
For a practical walkthrough of asynchronous training, see our GSM8K GRPO example.
For algorithmic details and empirical analysis, refer to the AReaL paper.