Debugging Guide#

Here’s how to debug AReaL training applications, including:

  • Debugging RolloutWorkflow with a persistent inference server;

  • Debugging custom RL algorithms;

  • Comparing the rollout results between Transformers and inference engine.

Debugging RolloutWorkflow with a Persistent Inference Server#

The trick is to launch a standalone, persistent inference server for your agent’s generation logic. This way, you can test repeatedly without restarting the server each time.

Why this works well:

  • Lightweight - Your debug program only needs CPU while inference runs on GPU

  • IDE friendly - Works perfectly with VS Code’s Python debugger

  • Fast iterations - No need to restart servers between debugging sessions

1. Launch the Standalone SGLang Server#

First, start your SGLang server with an inference-only allocation_mode like sglang.d4p1t1:

nohup python -m areal.launcher.local examples/math/gsm8k_grpo.py \
    --config examples/math/gsm8k_grpo.yaml \
    allocation_mode=sglang.d4p1t1 > llm_server.log 2>&1 &

Note: For debugging purposes, only the allocation_mode and sglang configs matter. You can ignore everything else in the example YAML file. In addition, it is strongly recommended to examine the launch arguments related to the inference engine. For example, you may need to check if sglang.enable_multimodal should be set based on your model type since multimodal is disabled in SGLang by default in models such as Gemma3, Llama4, and Step3VL.

Once it’s running, you’ll find the server address in the log:

LLM inference server launched at: AREAL_LLM_SERVER_ADDRS=127.0.0.1:20082

2. Run Your Debug Program#

Create a debug script (e.g., agent_debug.py) with your custom workflow implementation:

# Create dataset and dataloaders
train_dataset = get_custom_dataset(...)
# Select a small subset of the dataset for debugging
train_dataset = train_dataset.select(range(config.train_dataset.batch_size))
train_dataloader = StatefulDataLoader(...)

# Initialize inference engine - reads server addresses from environment variable
rollout = RemoteSGLangEngine(config.rollout)
rollout.initialize(...)

# Create rollout workflow
workflow = MyWorkflow(...)

dump_dir = os.path.join(
    StatsLogger.get_log_path(config.stats_logger), "generated"
)

data_generator = cycle_dataloader(train_dataloader)
generated_data = rollout.rollout_batch(next(data_generator), workflow=workflow)

# Save generated data for later use
torch.save(generated_data, os.path.join(dump_dir, "batch_data.pt"))

rollout.destroy()

Now run your debug script, passing the server address through the environment:

AREAL_LLM_SERVER_ADDRS=127.0.0.1:20082 \
    python agent_debug.py --config agent_debug.yaml \
    rollout.enable_rollout_tracing=True

Debugging Custom RL Algorithms#

If you’re using existing AReaL algorithms like GRPO, you can skip this section.

For custom RL algorithms, you can debug them just like offline training (think SFT) by using pre-generated data instead of running inference.

This approach is great because:

  • No inference servers - You don’t need to manage any servers

  • Faster iterations - Skip the expensive data collection step

  • Reproducible - Use the same data across debugging sessions

  • Isolated testing - Focus purely on your RL logic

1. Configure Allocation Mode#

First, turn off SGLang inference in your config:

allocation_mode: d4p1t1

2. Create Your RL Debug Script#

Then create your debug script that loads the pre-generated data:

# Create dataset and dataloaders
train_dataset = get_custom_dataset(...)
train_dataloader = StatefulDataLoader(train_dataset, ...)

# Configure tokenizer stop tokens
if tokenizer.pad_token_id not in config.gconfig.stop_token_ids:
    config.gconfig.stop_token_ids.append(tokenizer.pad_token_id)
if tokenizer.eos_token_id not in config.gconfig.stop_token_ids:
    config.gconfig.stop_token_ids.append(tokenizer.eos_token_id)

# Load previously generated data
dump_dir = os.path.join(
    StatsLogger.get_log_path(config.stats_logger), "generated"
)
batch = torch.load(os.path.join(dump_dir, "batch_data.pt"), weights_only=False)

# Prepare batch for training
batch = batch.to('cuda')
dist.barrier(device_ids=[actor.device.index])
torch.cuda.synchronize()

# Your custom algorithm logic here
...

Comparing the rollout results between Transformers and inference engine#

It is often useful to compare the rollout results between Transformers and the inference engine to ensure consistency and correctness. Most models will yield nearly identical results, but some models may have significant differences because the inference engine does a lot of efforts in accelerating the forward process.

If you suspect any discrepancies, or if your workflow involves models that do not have first-class support in Transformers/SGLang, it is recommended to use a simple script to compare the outputs against a dataset. Please refer to examples/docs/debug/cmp_rollout.py for a complete example, which compares the rollout results of google/gemma3-4b-it on BUAADreamer/clevr_count_70k dataset.