Michael Luo, Sijun Tan, Justin Wong†, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo**

Advisors: Tianjun Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica*

*****: Project Leads; : Significant Contributor

<aside> ✨

TL;DR

RL magic is in the air! We introduce DeepScaleR-1.5B-Preview, a language model finetuned from Deepseek-R1-Distilled-Qwen-1.5B using simple reinforcement learning (RL). It achieves an impressive 43.1% Pass@1 accuracy on AIME2024 (+14.3% improvement over the base model), surpassing the performance of OpenAI’s o1-preview with just 1.5B parameters. We open sourced our dataset, code and training logs for everyone to progress on scaling intelligence with RL.

🌐 Website, 👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📈 Wandb Logs, 🔎 Eval Logs

</aside>

DeepScaleR-1.5B-Preview

Model AIME 2024 MATH 500 AMC 2023 Minerva Math Olympiad Bench Avg.
DeepScaleR-1.5B-Preview 43.1 87.8 73.6 30.2 50.0 57.0
DeepSeek-R1-Distill-Qwen-1.5B 28.8 82.8 62.9 26.5 43.3 48.9
O1-Preview 40.0 81.4 - - - -

Figure1: DeepScaleR’s Pass@1 accuracy on AIME2024 as training progresses. At step 1040 and 1520, the context length is extended to 16K and 24K.

Figure1: DeepScaleR’s Pass@1 accuracy on AIME2024 as training progresses. At step 1040 and 1520, the context length is extended to 16K and 24K.

In this blog, we take a step towards unveiling the recipe of using RL to turn a small model into a strong reasoning model. We introduce DeepScaleR-1.5B-Preview, trained on 40K high-quality math problems with 3,800 A100 hours ($4500), outperforming OpenAI’s o1-preview on multiple competition-level math benchmarks.

Introduction: Towards Democratizing RL for LLMs

The recent open-source release of Deepseek-R1 (a model comparable to OpenAI’s o1) marks a significant leap forward in democratizing reasoning models. Yet, its’ exact training recipe, hyperparameters, and underlying systems are still unavailable. In this work, we take a major step towards a fully open-recipe that scales up RL for reasoning models.

One of the biggest challenges in scaling RL is the high computational cost. For instance, we found that directly replicating DeepSeek-R1’s experiments (⩾32K context, ~8000 steps) takes at least 70,000 A100 GPU hours—even for a 1.5B model. To address this, we leverage a distilled model and introduce a novel iterative lengthening scheme for RL, reducing the compute requirement to just 3,800 A100 GPU hours—an 18.42× reduction—while achieving performance surpassing OpenAI’s o1-preview with just a 1.5B model.

Our work demonstrates that developing customized reasoning models through RL can be both scalable and cost-efficient. In the remaining blog post, we’ll walk through our dataset curation and training approach, present evaluation results, and share key insights from our findings.

DeepScaleR’s Recipe

Dataset Curation

For our training dataset, we compiled AIME problems from 1984-2023 and AMC problems prior to 2023, along with questions from the Omni-MATH and Still datasets, which feature problems from various national and international math competitions.

Our data processing pipeline consists of three key steps:

  1. Extracting Answers: For datasets such as AMC and AIME, we use gemini-1.5-pro-002 to extract answers from official AoPS solutions.
  2. Removing Redundant Questions: We employ RAG with embeddings from sentence-transformers/all-MiniLM-L6-v2 to eliminate duplicate problems. To prevent data contamination, we also check for overlaps between the training and test sets.
  3. Filtering Ungradable Questions: Some datasets, such as Omni-MATH, include problems that cannot be evaluated using sympy and require an LLM judge. Since using LLM judges may slow down training and introduce noisy reward signals, we apply an additional filtering step to remove these ungradable questions.

After deduplication and filtering, our final training dataset consists of approximately 40,000 unique problem-answer pairs. We will expand our dataset for future runs.