Agentic Reinforcement Learning

Agentic Reinforcement Learning#

This guide demonstrates how to use AReaL to train agentic models with popular agent frameworks, e.g., CAMEL-AI, OpenAI Agents SDK, etc, enabling you to leverage their agent orchestration capabilities while using AReaL’s distributed reinforcement learning training system.

Overview#

The core design philosophy of agentic RL in AReaL is unified training and deployment. In other words, we expect the user to use the same code for both training and evaluation without any change. However, this is usually difficult because agent frameworks:

Lack token-level access: Agent frameworks interact with language models through high-level APIs (e.g., OpenAI’s chat completion API), which do not expose token IDs and log probabilities needed for RL training.
Have no reward mechanism: Agent frameworks are designed for inference and do not have built-in reward functions. RL training requires reward signals to guide policy optimization.
Have limited parallelization: Standard agent usage involves sequential execution, making it difficult to efficiently collect diverse trajectories needed for RL training.

AReaL addresses these limitations by providing:

Proxy model client with token-level tracking: AReaL sets up an HTTP proxy server that routes all LLM calls to AReaL’s inference engine (SGLang or vLLM). Every interaction is automatically tracked with complete token-level information.
Reward assignment and propagation: AReaL provides a flexible reward system that allows you to assign rewards to specific interactions or entire trajectories. The system automatically builds conversation trees and supports reward backpropagation.
Parallel trajectory collection: AReaL’s workflow system enables concurrent execution of multiple agent instances, allowing you to collect diverse trajectories for each query.

We demonstrate several concrete examples below. More examples can be found in the workflow/ directory.

Scheduler Compatibility: Agent workflows with the proxy approach are supported on local and slurm schedulers only. The ray scheduler is not supported because Ray’s actor-based programming model is inherently incompatible with HTTP proxy servers that require persistent connections between workers.

Examples#

Training with OpenAI Agent#

Step 1: Build a Runnable Agent#

Implement a standard agent loop with tool calling. This code snippet has nothing to do with AReaL—this agent can run with OpenAI’s official models with a proper API key and base URL.

Place the following content in my_agent.py where AReaL can import:

from agents import (
    Agent,
    OpenAIProvider,
    RunConfig,
    SQLiteSession,
    function_tool,
)
from agents import Runner as OpenAIRunner
from math_verify import parse, verify
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletion

@function_tool
def add(a: float, b: float) -> float:
    """Add two numbers."""
    return a + b

@function_tool
def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

def math_reward_fn(completions: str, answer: str) -> float:
    ans = parse(completions)
    gold = parse(answer)
    return float(verify(ans, gold))

class MathAgent:
    async def run(self, data, **extra_kwargs):
        http_client = extra_kwargs.get("http_client", None)
        base_url = extra_kwargs.get("base_url", None) or os.getenv("OPENAI_BASE_URL")
        api_key = extra_kwargs.get("api_key", None) or os.getenv("OPENAI_API_KEY")
        # IMPORTANT: replace the `base_url` and `api_key`
        client = AsyncOpenAI(base_url=base_url, api_key=api_key, http_client=http_client, max_retries=0)
        content = data["messages"][-1]["content"]
        run_config = RunConfig(
            model_provider=OpenAIProvider(openai_client=client),
            model="default",  # no need to pass
            tracing_disabled=True,
        )
        agent = Agent(
            name="RLVR Math with Calculator",
            instructions="Answer the user's math questions using the available calculator tools. Don't give the answer directly, you must use tools to do the mathematical calculation.",
            tools=[add, multiply],
        )
        session = SQLiteSession("math")
        result = await OpenAIRunner.run(
            agent, input=content, session=session, run_config=run_config
        )
        # return reward
        return math_reward_fn(completions=result.final_output, answer=data["answer"])

Step 2: Integrate the Agent#

Pass the agent path to the trainer:

import sys

from areal import PPOTrainer
from areal.api.cli_args import load_expr_config, GRPOConfig
from areal.dataset import get_custom_dataset
from areal.utils.hf_utils import load_hf_tokenizer

def main(args):
    config, _ = load_expr_config(args, GRPOConfig)
    tokenizer = load_hf_tokenizer(config.tokenizer_path)

    train_dataset = get_custom_dataset(
        split="train",
        dataset_config=config.train_dataset,
        tokenizer=tokenizer,
    )

    with PPOTrainer(
        config,
        train_dataset=train_dataset
    ) as trainer:
        trainer.train(
            workflow="my_agent.MathAgent"
        )

if __name__ == "__main__":
    main(sys.argv[1:])

The full working OpenAI Agents training example is located in examples/agent_workflow/. To run the example on a single node:

python3 examples/agent_workflow/train.py \
    --config examples/agent_workflow/config.yaml \
    scheduler.type=local workflow=my_agent.MathAgent

Training with CAMEL-AI#

Legacy Pattern: The direct approach using ArealOpenAI with RolloutWorkflow is considered legacy. For new projects, prefer the proxy approach (like the OpenAI Agent example above) which keeps your agent code independent from AReaL.

CAMEL-AI is an open-source, modular framework for building intelligent multi-agent systems. It provides a flexible agent architecture that can handle complex dialogue flows, tool calling, and multi-agent interactions.

Step 1: Write a CAMEL Agent#

A typical CAMEL agent is straightforward to write:

from camel.agents import ChatAgent

# Create a basic CAMEL agent
agent = ChatAgent(
    system_message="You are a helpful math assistant.",
    model="gpt-4o-mini",
)

# Run the agent
response = await agent.astep("Solve: 2 + 2 = ?")
print(response.msg.content)

Step 2: Convert to an RL-Trainable Agent#

To make this agent trainable with AReaL, replace the model with AReaL’s OpenAI-compatible model:

from camel.agents import ChatAgent
from areal.experimental.camel.openai_model import AReaLOpenAICompatibleModel
from areal.experimental.openai import ArealOpenAI

# Create AReaL's OpenAI-compatible client
client = ArealOpenAI(engine=engine, tokenizer=tokenizer)

# Replace the model with AReaL's OpenAI-compatible model
agent = ChatAgent(
    system_message="You are a helpful math assistant.",
    model=AReaLOpenAICompatibleModel(
        openai_client=client,
        tokenizer=tokenizer,
        model_type="areal"
    )
)

# Now the client (ArealOpenAI) records token-level information
response = await agent.astep("Solve: 2 + 2 = ?")

Step 3: Add Reward Evaluation#

After the agent responds, check if the answer is correct and set the reward:

def math_reward_fn(result, answer):
    """Simple reward function: 1.0 if correct, 0.0 otherwise."""
    return 1.0 if result.strip() == answer.strip() else 0.0

# Run the agent
response = await agent.astep("Solve: 2 + 2 = ?")

# Evaluate and set reward
reward = math_reward_fn(response.msg.content, "4")
client.set_last_reward(reward)

Step 4: Wrap the Agent in a Reusable Class#

To integrate the agent into AReaL’s training pipeline, wrap it in a class that manages the agent lifecycle and reward evaluation:

from areal.api.reward_api import AsyncRewardWrapper
from transformers import PreTrainedTokenizerFast

class CamelMathAgent:
    def __init__(self, tokenizer: PreTrainedTokenizerFast):
        self.tokenizer = tokenizer
        # Wrap reward function for async execution
        self.async_reward_fn = AsyncRewardWrapper(math_reward_fn)

    async def run_agent(self, data, client: ArealOpenAI):
        """Run agent on a dataset sample."""
        # Create agent with AReaL OpenAI-compatible model
        agent = ChatAgent(
            system_message="You are a helpful math assistant.",
            model=AReaLOpenAICompatibleModel(...),
        )

        # Run agent
        response = await agent.astep(data["messages"][-1]["content"])
        content = response.msg.content

        # Evaluate reward and set on client
        reward = await self.async_reward_fn(result=content, answer=data["answer"])
        client.set_last_reward(reward)

        return reward

Step 5: Create the Rollout Workflow#

Integrate the agent into AReaL’s RolloutWorkflow:

from areal.api.workflow_api import RolloutWorkflow
from areal.api.cli_args import GenerationHyperparameters

class CamelRLVRWorkflow(RolloutWorkflow):
    def __init__(
        self,
        gconfig: GenerationHyperparameters,
        tokenizer: PreTrainedTokenizerFast,
    ):
        self.gconfig = gconfig
        self.tokenizer = tokenizer
        self.agent = CamelMathAgent(tokenizer=self.tokenizer)

    async def arun_episode(self, engine, data):
        """Run one training episode."""
        # Create one client per trajectory
        client = ArealOpenAI(engine=engine, tokenizer=self.tokenizer)

        # Run agent
        reward = await self.agent.run_agent(data=data, client=client)

        # Apply reward discounting for multi-turn conversations
        client.apply_reward_discount(turn_discount=0.9)
        # Export interactions with token-level data
        return client.export_interactions(style="individual")

Key points:

Parallel episode execution: AReaL’s training loop calls arun_episode in parallel across multiple samples.
Reward discounting: For multi-turn conversations, rewards are discounted backward through the conversation tree.
Interactions export: All interactions with token-level data and rewards are exported for RL training.

Step 6: Run the Training#

Use the workflow in AReaL’s training loop:

workflow = CamelRLVRWorkflow(
    gconfig=config.gconfig,
    tokenizer=tokenizer,
)

# AReaL will call workflow.arun_episode() for each batch

See the complete train script for a full working implementation.

More Examples#

Beyond the two examples above, AReaL supports integration with various other agent frameworks and SDKs:

Claude Agent SDK: Train agents using Anthropic’s Claude Agent SDK with MCP tools. See the Claude example for a math agent with calculator tools.
LangChain: Integrate LangChain agents with AReaL’s training infrastructure. See the LangChain example for details.

Under the Hood#

For a detailed explanation of how AReaL’s agentic training infrastructure works, including the proxy server architecture, session lifecycle, token-level tracking, and reward backpropagation, see the Agent Workflow Reference.

Key topics covered in the reference:

Two integration paradigms: Proxy approach vs. direct approach
Architecture: Proxy server, endpoints, and inference engine layers
Session lifecycle: Capacity reservation, session management, and export
Token-level tracking: InteractionCache and InteractionWithTokenLogpReward
Reward system: Assignment methods and backpropagation algorithm
Workflow resolution: How AReaL detects and wraps agent workflows