Overview#
In the launching script, the CLI arguments are first converted into an experiment
configuration of type AsyncPPOMathConfig
. We then call the initial_setup
method of
the experiment configuration to obtain the worker configurations required for the
experiment. Next, we launch several types of workers.
Among these workers, GenerationServer
, RolloutWorker
, and GserverManager
are
responsible for rollout in asynchronous RL. ModelWorker
and MasterWorker
are
responsible for training.
Workers independently execute the _poll
or _poll_async
method in a while loop, where
the core logic of these workers is implemented.
Note
For SFT and synchronous PPO, only trainer workers are launched.
For asynchronous RL, the trainer side treats the rollout side as “datasets”. The
“dataset” does not load from disk but pulls data from a TCP socket, implemented in
stream_dataset.py
. This approach unifies offline training workflows (e.g., SFT) and
online RL workflows (e.g., PPO) by simply specifying different dataset types.