Overview

Overview#

overview

In the launching script, the CLI arguments are first converted into an experiment configuration of type AsyncPPOMathConfig. We then call the initial_setup method of the experiment configuration to obtain the worker configurations required for the experiment. Next, we launch several types of workers.

Among these workers, GenerationServer, RolloutWorker, and GserverManager are responsible for rollout in asynchronous RL. ModelWorker and MasterWorker are responsible for training.

Workers independently execute the _poll or _poll_async method in a while loop, where the core logic of these workers is implemented.

Note

For SFT and synchronous PPO, only trainer workers are launched.

For asynchronous RL, the trainer side treats the rollout side as “datasets”. The “dataset” does not load from disk but pulls data from a TCP socket, implemented in stream_dataset.py. This approach unifies offline training workflows (e.g., SFT) and online RL workflows (e.g., PPO) by simply specifying different dataset types.