Troubleshooting#

If the following content does not address your issue, feel free to raise a GitHub Issue.

Automatic Recovery#

When setting recover_mode=auto and the experiment configuration remains unchanged, AReaL will attempt to discover previous checkpoints and recover the experiment from them.

Recovery Failure Causes#

If automatic recovery fails, check the following possibilities:

Configuration Changes:

  • The experiment_name and trial_name in the training script differ from the previous run

  • Changes in batch size (dataset.train_bs_n_seqs parameter)

  • Changes in group size (group_size parameter)

  • Changes in number of nodes (n_nodes parameter)

Missing Recovery Checkpoints: Recovery checkpoints are generated under two conditions by default:

  • After completion of the second step

  • When a step completes and more than 600 seconds have passed since the last recovery checkpoint (controlled by exp_ctrl.ckpt_freq_secs=600)

Verify Recovery Checkpoint Creation#

You can confirm if a recovery checkpoint was generated by searching for the following message in the logs:

(master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.

Memory Issues#

torch.cuda.CudaOutOfMemoryError#

The key to resolving this issue is identifying the phase where the error occurs:

During Initialization#

  • Check for idle processes on the GPU

  • Distributed scenarios: Restart the Ray cluster

  • Single-machine scenarios: Use pkill to terminate processes

During SGLang Generation#

  • Decrease the actor.sglang.mem_fraction_static parameter

  • Increase the tensor parallelism degree

  • Decrease the max_concurrent_rollouts parameter for asynchronous RL

During actor_inf or actor_train#

  • Adjust microbatch size: Decrease the parameter {actor_train|actor_inf}.mb_spec.max_tokens_per_mb=20480. This parameter limits tokens per forward/backward pass and can be set as low as the maximum sequence length (including prompt)

  • Modify parallelism strategy: Adjust allocation_mode by:

    • Reducing data parallelism

    • Increasing tensor or pipeline parallelism

    • Preferring pipeline parallelism over tensor parallelism

CUDA Error: Out of Memory#

This issue may occur during data transfer. Try increasing mem_per_model_worker in the CLI arguments.