Evaluation#
Option 1: Using AReaL’s Standalone Evaluation Package#
The evaluation code is located in the evaluation
folder of the repository. Following
the previous tutorial, trained checkpoints will be saved under
${fileroot}/checkpoints/${USER}/${experiment_name}/${trial_name}/
.
Setup Evaluation Environment#
Note: Evaluation requires updates to certain Python libraries like vLLM, so avoid using the training container or virtual environment for this task.
From the repository directory, create a new conda environment:
conda create -n areal-eval python=3.12
conda activate areal-eval
Install dependencies:
bash examples/env/setup-eval-pip-deps.sh
Run Evaluation#
Specify an output_path
to save the test results. If not specified, the results will be
saved in model_path
.
Math Evaluation#
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names math_500,aime24,amc23 \
--prompt_type qwen3-think \
--task math &> eval_and_aggregate_parallel.log &
Code Evaluation#
Obtaining Data:
Due to the size of code datasets (some test cases are relatively large), we have uploaded all our code datasets to Hugging Face.
Once you have downloaded the code dataset, place it under
./evaluation/data/
.
Running Evaluation:
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3-think-pure \
--temperature 1.0 \
--top_p 0.95 \
--num_sample_nodes 8 \
--samples_per_node 1 \
--n_sampling $((num_sample_nodes * samples_per_node)) \
--task code &> eval_and_aggregate_parallel.log &
Command Line Parameters#
--model_path
: Path to the saved model parameters--output_path
: Path to store generated answers and log files during evaluation--data_names
: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options:Math:
math_500
,aime24
,aime25
,amc23
Code:
lcb_v5
,lcb_v5_2410_2502
,codeforces
,code_contest_all
--max_gen_tokens
: Maximum length of generated answers (default: 32768)--prompt_type
: Specify the prompt template. For our latest model, we useqwen3-think
for math datasets andqwen3-think-pure
for code datasets.--num_sample_nodes
: Number of multiple sampling seeds to ensure sampling diversity.--samples_per_node
: Number of samples to generate per seed for each problem.
Logs and Evaluation Results#
Check ${output_path}/math_eval_${max_gen_tokens}/logs
to review the log of each
worker.
The evaluation script will output a results table in the terminal:
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| dataset | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| math_500 | 500 | 6757.4 | 4139.5 | 84.4 | 92.7 | 97.3 | 97.7 |
| aime24 | 30 | 19328.0 | 13663.5 | 50.0 | 50.4 | 77.3 | 80.0 |
| amc23 | 40 | 8850.0 | 6526.2 | 80.0 | 90.5 | 96.8 | 98.8 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
Metrics Explanation#
{greedy|sample}_length
: Average answer length under greedy or random sampling strategygreedy_acc
: Average accuracy under greedy samplingsample_pass@{k}
: Probability of generating a correct answer withink
attempts under random sampling
For the Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to CodeElo and rllm:
+------------+----------------+-----------+
| Dataset | Percentile (%) | CF Rating |
+------------+----------------+-----------+
| codeforces | 17.9 | 590 |
+------------+----------------+-----------+
CF Rating
: The overall Elo rank score of the model across 57 Codeforces contests.Percentile
: The Elo ranking percentile of the model among all Codeforces users.
Note: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and taking the average score as the final result.
Configuration Details#
Sampling Parameters#
The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
We observed that the
enforce_eager
parameter in vLLM significantly impacts evaluation performance. Whenenforce_eager=True
, we can reproduce the model performance reported in previous work. Without this setting, evaluation results may fall below reported performance. Therefore, we enforceenforce_eager=True
during evaluation.
Runtime Expectations#
Due to the sampling requirements and enforce_eager
setting, the evaluation process
typically takes considerable time.
Runtime depends on several factors:
Maximum generation length
Number of questions in the dataset
Model size
Performance benchmarks (on 8x H100 GPUs):
AIME dataset: ~80 minutes
MATH_500 dataset: ~160 minutes
Option 2: Using AReaL-lite’s Integration for RL Rollout#
When using AReaL-lite, you can leverage the existing workflow from your training script for evaluation purposes. This approach allows you to reuse the rollout components you’ve already implemented, making the transition from training to evaluation seamless.
Setting Up Evaluation#
To set up evaluation using this method, create a new script that contains only the rollout portion of your original training script. This focused script should include the data collection and inference logic without the training components.
Once you have prepared your evaluation script, launch the evaluation experiment using the following command:
python3 -m areal.launcher.${local|slurm|ray} my_script.py --config my-config.yaml allocation_mode=sglang.d4p1t2+eval
Key Differences from Training#
The primary difference between training and evaluation commands lies in the
allocation_mode
parameter. In the evaluation setup, we remove the allocation of
training processes and replace them with an eval
annotation. This modified allocation
mode launches CPU processes that run your customized workflow and collect data from
the LLM server, optimizing resource usage for evaluation tasks.
Note: Optimal configurations often differ significantly between training and evaluation phases. As a user, you are responsible for ensuring that your evaluation experiment configuration and evaluation metrics are appropriate for your specific use case. Pay particular attention to settings such as
GenerationHyperparameters
and reward function to ensure they align with your evaluation objectives rather than your training setup.