Evaluation#
The evaluation code is located in the evaluation
folder of the repository. Following the previous tutorial, trained checkpoints will be saved under ${fileroot}/checkpoints/${USER}/${experiment_name}/${trial_name}/
.
Setup Evaluation Environment#
Note: Evaluation requires updates to certain Python libraries, so avoid using the training container or virtual environment for this task.
From the repository directory, create a new conda environment:
conda create -n areal-eval python=3.12
conda activate areal-eval
Install dependencies:
bash examples/env/scripts/setup-eval-pip-deps.sh
Run Evaluation#
Specify an output_path
to save the test results. If not specified, the results will be saved in model_path
.
Math Evaluation#
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names math_500,aime24,amc23 \
--prompt_type qwen3-think \
--task math &> eval_and_aggregate_parallel.log &
Code Evaluation#
Obtaining Data:
Due to the size of code datasets (some test cases are relatively large), we have uploaded all our code datasets to Hugging Face.
Once you have downloaded the code dataset, place it under
./evaluation/data/
.
Running Evaluation:
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3-think-pure \
--num_sample_nodes 8 \
--samples_per_node 1 \
--n_sampling $((num_sample_nodes * samples_per_node)) \
--task code &> eval_and_aggregate_parallel.log &
Command Line Parameters#
--model_path
: Path to the saved model parameters--output_path
: Path to store generated answers and log files during evaluation--data_names
: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options:Math:
math_500
,aime24
,aime25
,amc23
Code:
lcb_v5
,lcb_v5_2410_2502
,codeforces
,code_contest_all
--max_gen_tokens
: Maximum length of generated answers (default: 32768)--prompt_type
: Specify the prompt template. For our latest model, we useqwen3-think
for math datasets andqwen3-think-pure
for code datasets.--num_sample_nodes
: Number of multiple sampling seeds to ensure sampling diversity.--samples_per_node
: Number of samples to generate per seed for each problem.
Logs and Evaluation Results#
Check ${output_path}/math_eval_${max_gen_tokens}/logs
to review the log of each worker.
The evaluation script will output a results table in the terminal:
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| dataset | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| math_500 | 500 | 6757.4 | 4139.5 | 84.4 | 92.7 | 97.3 | 97.7 |
| aime24 | 30 | 19328.0 | 13663.5 | 50.0 | 50.4 | 77.3 | 80.0 |
| amc23 | 40 | 8850.0 | 6526.2 | 80.0 | 90.5 | 96.8 | 98.8 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
Metrics Explanation#
{greedy|sample}_length
: Average answer length under greedy or random sampling strategygreedy_acc
: Average accuracy under greedy samplingsample_pass@{k}
: Probability of generating a correct answer withink
attempts under random sampling
For the Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to CodeElo and rllm:
+------------+----------------+-----------+
| Dataset | Percentile (%) | CF Rating |
+------------+----------------+-----------+
| codeforces | 17.9 | 590 |
+------------+----------------+-----------+
CF Rating
: The overall Elo rank score of the model across 57 Codeforces contests.Percentile
: The Elo ranking percentile of the model among all Codeforces users.
Note: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and taking the average score as the final result.
Configuration Details#
Sampling Parameters#
The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
We observed that the
enforce_eager
parameter in vLLM significantly impacts evaluation performance. Whenenforce_eager=True
, we can reproduce the model performance reported in previous work. Without this setting, evaluation results may fall below reported performance. Therefore, we enforceenforce_eager=True
during evaluation.
Runtime Expectations#
Due to the sampling requirements and enforce_eager
setting, the evaluation process typically takes considerable time.
Runtime depends on several factors:
Maximum generation length
Number of questions in the dataset
Model size
Performance benchmarks (on 8x H100 GPUs):
AIME dataset: ~80 minutes
MATH_500 dataset: ~160 minutes