Troubleshooting#
If the following content does not address your issue, feel free to raise a GitHub Issue.
Automatic Recovery#
When setting recover_mode=auto
and the experiment configuration remains unchanged, AReaL will attempt to discover previous checkpoints and recover the experiment from them.
Recovery Failure Causes#
If automatic recovery fails, check the following possibilities:
Configuration Changes:
The
experiment_name
andtrial_name
in the training script differ from the previous runChanges in batch size (
dataset.train_bs_n_seqs
parameter)Changes in group size (
group_size
parameter)Changes in number of nodes (
n_nodes
parameter)
Missing Recovery Checkpoints: Recovery checkpoints are generated under two conditions by default:
After completion of the second step
When a step completes and more than 600 seconds have passed since the last recovery checkpoint (controlled by
exp_ctrl.ckpt_freq_secs=600
)
Verify Recovery Checkpoint Creation#
You can confirm if a recovery checkpoint was generated by searching for the following message in the logs:
(master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.
Memory Issues#
torch.cuda.CudaOutOfMemoryError#
The key to resolving this issue is identifying the phase where the error occurs:
During Initialization#
Check for idle processes on the GPU
Distributed scenarios: Restart the Ray cluster
Single-machine scenarios: Use
pkill
to terminate processes
During SGLang Generation#
Decrease the
actor.sglang.mem_fraction_static
parameterIncrease the tensor parallelism degree
Decrease the
max_concurrent_rollouts
parameter for asynchronous RL
During actor_inf
or actor_train
#
Adjust microbatch size: Decrease the parameter
{actor_train|actor_inf}.mb_spec.max_tokens_per_mb=20480
. This parameter limits tokens per forward/backward pass and can be set as low as the maximum sequence length (including prompt)Modify parallelism strategy: Adjust
allocation_mode
by:Reducing data parallelism
Increasing tensor or pipeline parallelism
Preferring pipeline parallelism over tensor parallelism
CUDA Error: Out of Memory#
This issue may occur during data transfer. Try increasing mem_per_model_worker
in the CLI arguments.