Diagnosing RL Performance

Diagnosing RL Performance#

This guide helps you diagnose and resolve common performance issues in reinforcement learning training. Use the strategies below to identify bottlenecks, tune hyperparameters, and optimize your RL workflows.

Using Synchronous RL Instead of Asynchronous Training#

If you suspect asynchronous RL training impacts learning performance, or if you want to debug a new agentic application, you can switch to standard synchronous RL training with the following configuration:

rollout:
  max_head_offpolicyness: 0  # 0 implies synchronous training
actor:
  recompute_logprob: false  # use logprobs returned by inference backend
  use_decoupled_loss: false  # reverts to the original PPO loss

For detailed information about these configurations, see our asynchronous RL guide and CLI reference.

Training Rewards Not Increasing#

This is a common issue that may be due to multiple reasons. We recommend the following diagnostic steps:

Establish a baseline: Run evaluation on the test set to measure baseline performance before training. AReaL allows zero-code changes between training and evaluation, so you can reuse your training code for evaluation. See Evaluation Guide for details.
Test on simpler data: Run RL training on the test set instead of the training set to verify whether rewards increase.
If rewards don’t increase on the test set: Tune your hyperparameters (e.g., increase batch size or learning rate) or switch to a different base model. Consider applying SFT first, as this indicates the task may be too difficult for your current model.
If rewards increase on test set but not training set: Inspect the quality and difficulty of your training data. Ensure the distributions match and the difficulty is appropriate for your base model. You can enable dynamic filtering (similar to DAPO) by passing a should_accept_fn parameter to prepare_batch to ensure task difficulty remains appropriate during runtime. See our detailed code walk-through for more information.

Important Metrics to Monitor#

Monitoring these metrics helps ensure stable training and diagnose issues early.

Reward Metrics#

Metric	Description
`eval-rollout/reward`	Reward on the test set. Primary indicator of model generalization.
`rollout/reward`	Reward on the training set. Tracks learning progress during training.
`ppo/actor/task_reward`	Reward of trajectories actually used for training. Differs from `rollout/reward` when dynamic filtering is enabled—filtered trajectories are excluded here but still count in `rollout/reward`.

Troubleshooting high variance in task_reward:

If task_reward fluctuates significantly, your training has high variance. Consider increasing batch size if resources allow—this is often an effective remedy.

Importance Weight Metrics#

We recommend asynchronous training with decoupled PPO loss (use_decoupled_loss=true) for optimal throughput. The two importance weight metrics below are critical for monitoring training stability—their average values should remain very close to 1.0.

With use_decoupled_loss=true, the loss function separates three policies:

π_behave: Behavior policy that generated the samples during rollout
π_proximal: Proximal policy, one training step behind the current policy
π_θ: Current policy being optimized

The decoupled PPO loss combines two importance weights:

\[\begin{split}L = -\\mathbb{E}\\left\[ \\underbrace{\\frac{\\pi\_{\\text{proximal}}}{\\pi\_{\\text{behave}}}}_{\\text{behave_imp_weight}} \\cdot \\min\\left( \\underbrace{\\frac{\\pi_\\theta}{\\pi\_{\\text{proximal}}}}_{\\text{importance_weight}} A, \\text{clip}\\left(\\frac{\\pi_\\theta}{\\pi\_{\\text{proximal}}}, 1-\\epsilon, 1+\\epsilon\\right) A \\right) \\right\]\end{split}\]

Metric	Formula	Description
`ppo_actor/update/importance_weight`	π_θ / π_proximal	Ratio for PPO clipping between current and proximal policies
`ppo_actor/update/behave_imp_weight`	π_proximal / π_behave	Off-policy correction for distribution mismatch in async training

Troubleshooting importance_weight deviations:

If importance_weight/avg deviates significantly from 1, reduce ppo_n_minibatches.
With ppo_n_minibatches == 1, theoretically importance_weight should equal 1 exactly.
If deviation persists at ppo_n_minibatches == 1 (common in MoE training), add actor.megatron.use_deterministic_algorithms=1 to your config.

Troubleshooting behave_imp_weight deviations:

Ensure behave_imp_weight_cap is set (recommended value: 5).
If deviation persists, reduce max_head_offpolicyness to decrease sample staleness.

Sequence Length Metrics#

When training with long trajectories, monitor these metrics to detect truncation issues:

Metric	Description
`ppo_actor/no_eos_ratio`	Fraction of trajectories truncated before generating EOS token
`ppo_actor/seq_len`	Average sequence length during training

Troubleshooting high no_eos_ratio:

If no_eos_ratio exceeds 0.05 (5% of trajectories truncated):

Increase max_new_tokens to allow longer generations
Use dynamic filtering to exclude overly long trajectories

Monitoring sequence length growth:

If seq_len increases steadily during training, watch no_eos_ratio closely—growing sequence lengths often lead to more truncations.