Agentica x Together AI

Through a joint collaboration between the Agentica team and Together AI, we release DeepCoder-14B-Preview, a code reasoning model finetuned from Deepseek-R1-Distilled-Qwen-14B via distributed RL. It achieves an impressive 60.6% Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of o3-mini-2025-01-031 (Low) and o1-2024-12-17 with just 14B parameters. We’ve open-sourced our dataset, code, training logs, and systems optimizations for everyone to progress on scaling and accelerating intelligence with RL.

🌐 Website, 👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📈 Wandb, 🔎 Eval Logs

Agentica

Michael Luo*, Sijun Tan*, Roy Huang*, Xiaoxiang Shi, Rachel Xin, Colin Cai, Li Erran Li, Raluca Ada Popa, Ion Stoica

</aside>

Together AI

Ameen Patel*, Alpay Ariyak*, Qingyang Wu*, Maurice Weber, Ce Zhang

</aside>

*Project leads

DeepCoder-14B-Preview

| Model | LCB (Pass@1) (8/1/24-2/1/25) | Codeforces Rating | Codeforces Percentile | | --- | --- | --- | --- | | DeepCoder-14B-Preview | 60.6 | 1936 | 95.3 | | DeepSeek-R1-Distill-Qwen-14B | 53.0 | 1791 | 92.7 | | O3-Mini-2025-1-31 (Low) | 60.9 | 1918 | 94.9 | | O1-2024-12-17 (Low) | 59.5 | 1991 | 96.1 |

Figure 1: DeepCoder’s LiveCodeBench (LCB) score as training progresses. At step 180, context length is extended to 32K. The best 32K checkpoint is used for inference-time scaling to 64K, achieving 60.6% LCB—matching o3-mini’s performance.

In recent months, we’ve witnessed remarkable advances in scaling reasoning models for math domains (e.g. DeepScaleR, AReaL, Light-R1, DAPO) via reinforcement learning. However, progress in the coding domain has lagged behind, largely due to the challenge of constructing high-quality datasets with reliable, verifiable rewards.

In this blog, we democratize the recipe for training a small model into a strong competitive coder—on-par with o3-mini—using reinforcement learning. We introduce DeepCoder-14B-Preview, trained on 24K verifiable coding problems over 2.5 weeks on 32 H100s, reaching—and even surpassing—OpenAI’s o3-mini on various coding benchmarks. In addition, we open-source verl-pipe, an extension to the verl post-training system featuring several system optimizations that accelerate end-to-end training by 2x.

Dataset Curation

Prior work in the math domain has shown that reinforcement learning with verifiable rewards can significantly enhance a model’s reasoning capabilities. However, unlike math—where abundant high-quality, verifiable data is readily available on the internet—the coding domain suffers from a relative scarcity of such data.

In our early experiments, we evaluated several popular coding datasets—including APPS, TACO, CodeContests, KodCode, and LeetCode. We found that some were too easy (e.g., KodCode, LeetCode) for our model, while others were noisy or contained unverifiable problems with flawed or missing test cases. These issues often produced null or misleading reward signals, which ultimately destabilize RL training.

To overcome these limitations, we curated a high-quality training set consisting of:

TACO Verified problems.
Verified problems from PrimeIntellect’s SYNTHETIC-1 dataset.
LiveCodeBench problems submitted between May 1, 2023 and July 31, 2024.

To ensure data quality for effective RL training, we implemented a rigorous filtering pipeline:

Programmatic Verification: Every problem is automatically verified using an external, official solution. We filter our datasets to include only those problems whose official solutions pass all unit tests. This process is automated in tests/rewards/test_code_batch.py.