Troubleshooting

Troubleshooting#

If the following content does not address your issue, feel free to raise a GitHub Issue.

Automatic Recovery#

When setting recover_mode=auto and the experiment configuration remains unchanged, AReaL will attempt to discover previous checkpoints and recover the experiment from them.

Recovery Failure Causes#

If automatic recovery fails, check the following possibilities:

Configuration Changes:

The experiment_name and trial_name in the training script differ from the previous run
Changes in batch size (dataset.train_bs_n_seqs parameter)
Changes in group size (group_size parameter)
Changes in number of nodes (n_nodes parameter)

Missing Recovery Checkpoints: Recovery checkpoints are generated under two conditions by default:

After completion of the second step
When a step completes and more than 600 seconds have passed since the last recovery checkpoint (controlled by exp_ctrl.ckpt_freq_secs=600)

Verify Recovery Checkpoint Creation#

You can confirm if a recovery checkpoint was generated by searching for the following message in the logs:

(master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.

Memory Issues#

torch.cuda.CudaOutOfMemoryError#

The key to resolving this issue is identifying the phase where the error occurs:

During Initialization#

Check for idle processes on the GPU
Distributed scenarios: Restart the Ray cluster
Single-machine scenarios: Use pkill to terminate processes

During SGLang Generation#

Decrease the actor.sglang.mem_fraction_static parameter
Increase the tensor parallelism degree
Decrease the max_concurrent_rollouts parameter for asynchronous RL

During `actor_inf` or `actor_train`#

Adjust microbatch size: Decrease the parameter {actor_train|actor_inf}.mb_spec.max_tokens_per_mb=20480. This parameter limits tokens per forward/backward pass and can be set as low as the maximum sequence length (including prompt)
Modify parallelism strategy: Adjust allocation_mode by:
- Reducing data parallelism
- Increasing tensor or pipeline parallelism
- Preferring pipeline parallelism over tensor parallelism

CUDA Error: Out of Memory#

This issue may occur during data transfer. Try increasing mem_per_model_worker in the CLI arguments.