PromptCoT & PromptCoT-Mamba: Advancing the Frontiers of Reasoning

2025年4月1日 · 阅读需 5 分钟

Ant Group

News

May 30, 2025: PromptCoT-Mamba released! Introducing an attention-free foundation model for reasoning tasks.
Apr 11, 2025: PromptCoT-QwQ-32B model and its training data released, achieving new state-of-the-art results.
Mar 7, 2025: PromptCoT project launched, including the problem generation model, distilled models (PromptCoT-DS series), and associated datasets.

Overview

This repository unifies two synergistic projects aimed at advancing the frontiers of mathematical and code reasoning in Large Language Models (LLMs): PromptCoT and PromptCoT-Mamba.

PromptCoT (Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models) addresses the critical challenge of acquiring high-quality, complex problems for training advanced LLMs. It introduces a novel methodology to systematically generate Olympiad-level mathematical problems by modeling the rationale behind expert problem design. This approach not only enhances problem diversity and difficulty but also ensures logical consistency in problem construction, providing a scalable solution for creating robust training datasets.

PromptCoT-Mamba (Scaling Reasoning without Attention) leverages the problem generation capabilities of the PromptCoT pipeline to train PromptCoT-Mamba-7B, the first attention-free foundation model based on the Mamba-2 architecture. This model demonstrates that structured training curricula can enable attention-free models to surpass strong Transformer baselines on a wide array of competition-level math and code reasoning tasks, all while maintaining constant-memory inference without KV caching.

Together, these projects offer a powerful suite of tools, models, and datasets for researchers and developers working on the cutting edge of AI reasoning.

Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

✨ The Missing Piece for Test-Time Scaling: A lightweight yet powerful problem generation model enabling the construction of prompt sets at any scale with sufficient quality, perfect for SFT or RL post-training.
📖 A Fully Open Project: All models (generation, distilled LLMs) and datasets (generation inputs, SFT data) are open-sourced.
🏆 Superior Performance of Distilled Models:
- PromptCoT-DS-7B consistently surpasses its base model, DeepSeek-R1-Distill-Qwen-7B, with significant gains:
  - +0.9% on MATH-500 (93.7%)
  - +3.2% on AIME2024 (58.7%)
  - +9.2% on AIME2025 (49.2%)
- PromptCoT-DS-7B (7B parameters) achieves results comparable to larger 32B models like S1-32B and LIMO-32B.
- PromptCoT-QwQ-32B sets a new standard, outperforming other 32B models by a significant margin:
  - MATH-500: 96.7% ± 0.5%
  - AIME2024: 83.8% ± 2.8%
  - AIME2025: 75.4% ± 4.7%
- PromptCoT-DS-1.5B demonstrates competitive performance against RL-based models purely through distillation.
⚡ Efficiency Without Compromise: PromptCoT-DS-1.5B achieves 40+% AIME scores using over 15× fewer A100 GPU hours compared to models like DeepScaleR-1.5B-Preview.

2. PromptCoT-Mamba: Attention-Free Reasoning

🚀 First Attention-Free SOTA: PromptCoT-Mamba-7B is the first attention-free model (Mamba-2 architecture) to outperform strong Transformer baselines in math and code reasoning.
🧠 Trained with PromptCoT Pipeline: Utilizes a structured, two-stage curriculum with data generated by PromptCoT.
💪 Strong General Performance: PromptCoT-Mamba-7B consistently outperforms 7B-scale Transformer and hybrid Mamba-Transformer baselines.
- MATH-500: 84.6%
- AIME 2024: 35.2%
- AIME 2025: 24.6%
- Livecodebench: 29.9%
🎯 Math Specialization: The math-specialized variant, PromptCoT-Mamba-Math-7B, further boosts math performance:
- MATH-500: 88.0%
- AIME 2024: 42.9% (+7.7% over generalist)
- AIME 2025: 30.8% (+6.2% over generalist)
⚡ Inference Efficiency: Offers substantial speedups (e.g., 3.66× faster on 24GB GPU for long sequences) and constant-memory inference, ideal for cost-sensitive or long-context workloads.

Performance Details

PromptCoT Series Performance

Model	GSM8K	MATH-500	AIME2024	AIME2025
🔹 1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B	-	83.9%	28.9%	28.1%
STILL-3-1.5B-preview	-	85.5%	39.3%	-
DeepScaleR-1.5B-Preview	-	🟢 87.8%	🟢 43.1%	🟢 37.1%
PromptCoT-DS-1.5B (ours)	🟢 87.6% ± 0.5%	85.3% ± 1.1%	41.2% ± 6.9%	36.7% ± 6.2%
🔹 7B Models
DeepSeek-R1-Distill-Qwen-7B	-	92.8%	55.5%	40.0%
Qwen2.5-7B-SimpleRL	-	82.4%	26.7%	-
OpenThinker-7B	-	89.6%	30.0%	33.3%
OpenR1-Qwen-7B	-	90.6%	36.7%	40.0%
PromptCoT-DS-7B (ours)	🔥 92.8% ± 0.5%	🔥 93.7% ± 0.7%	🔥 58.7% ± 3.1%	🔥 49.2% ± 7.9%
🔹 32B Models
DeepSeek-R1-Distill-Qwen-32B	-	94.3%	72.6%	-
S1-32B	-	93.0%	56.7%	26.6%
LIMO-32B	-	94.8%	57.1%	46.6%
QwQ-32B	-	-	82.1%	70.8%
PromptCoT-QwQ-32B (ours)	🔥🔥 96.4% ± 0.2%	🔥🔥 96.7% ± 0.5%	🔥🔥 83.8% ± 2.8%	🔥🔥 75.4% ± 4.7%

PromptCoT-Mamba Performance

General Performance:

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-7B	84.6	🔥🔥35.2	🔥🔥24.6	50.7	81.7	75.0	🔥🔥29.9
Gemma3-27B	89.0	32.6	24.0	54.2	86.0	78.0	26.9
Gemma3-12B	83.8	22.9	19.2	49.9	81.1	73.2	22.2
Sky-T1-7B	85.0	19.2	19.2	49.2	41.5	37.2	18.3
S1.1-7B	82.0	19.2	17.5	43.1	64.0	56.7	13.3
Bespoke-Stratos-7B	81.2	18.3	16.3	45.0	73.2	68.3	8.6
Nemotron-H-8B	77.6	--	--	--	79.3	74.4	--
M1-3B	81.7	23.0	22.0	43.6	--	--	--

Math Specialization vs. Generalist:

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-Math-7B	🔥🔥88.0	🔥🔥42.9	🔥🔥30.8	🔥🔥52.1	71.3	66.5	20.3
PromptCoT-Mamba-7B	84.6	35.2	24.6	50.7	81.7	75.0	29.9

Citation

If you find PromptCoT or PromptCoT-Mamba useful in your research, please consider citing the respective papers:

For PromptCoT:

@article{zhao2025promptcot,
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
  title     = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
  year      = {2025},
  journal   = {arXiv preprint arXiv:2503.02324},
  url       = {http://arxiv.org/abs/2503.02324}
}

For PromptCoT-Mamba:

@article{zhao2025scaling,
  author    = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
  title     = {Scaling Reasoning without Attention},
  journal   = {arXiv preprint arXiv:2505.22425},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.22425}
}

News​

Overview​

Highlights & Key Results​

1. PromptCoT: Problem Generation & Distilled Models​

2. PromptCoT-Mamba: Attention-Free Reasoning​

Performance Details​

PromptCoT Series Performance​

PromptCoT-Mamba Performance​

Citation​