Agentica

Michael Luo*, Naman Jain*, Jaskirat Singh*, Sijun Tan*, Colin Cai*, Tarun Venkat, Manan Roongta, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica

</aside>

Together AI

Ameen Patel†, Qingyang Wu†, Alpay Ariyak†, Shang Zhu, Ben Athiwaratkun, Ce Zhang

</aside>

<aside> ✨

TL;DR

We introduce DeepSWE-Preview, a reasoning-enabled coding agent trained from Qwen3-32B with only reinforcement learning (RL). It achieves an impressive 59.0**%** on SWE-Bench-Verified with test-time scaling, reaching SOTA for open-weight coding agents (42.2% Pass@1, 71.0% Pass@16).

DeepSWE is trained using rLLM, our framework for post-training language agents. We’ve open sourced everything—our dataset, code, training, and eval logs, for everyone to progress on scaling and improving agents with RL.

🌐 Website, 👨‍💻 Github, 🤗 HF Dataset, 🤗 HF Model, 📈 Wandb Logs, 🔎 Eval Logs

</aside>

*,†: Major Contributors

DeepSWE-Preview

Figure 1: SWE-Bench-Verified Performance vs. Model Size for LLM Agents. By training from scratch with only reinforcement learning (RL), DeepSWE-Preview with test time scaling (TTS) solves 59% of problems, beating all open-source agents by a large margin. We note that DeepSWE-Preview’s Pass@1 performance (42.2%, averaged over 16 runs) is one of the best for open-weights coding agents.

Figure 1: SWE-Bench-Verified Performance vs. Model Size for LLM Agents. By training from scratch with only reinforcement learning (RL), DeepSWE-Preview with test time scaling (TTS) solves 59% of problems, beating all open-source agents by a large margin. We note that DeepSWE-Preview’s Pass@1 performance (42.2%, averaged over 16 runs) is one of the best for open-weights coding agents.

Figure 2: Validation Score for SWE-Bench-Hard, where an agent receives positive reward if it submits the final answer and passes all tests. With just 200 steps of RL training, SWE-Bench-Verified score increases from 23→42.2% (+20%) for Pass@1.

Figure 2: Validation Score for SWE-Bench-Hard, where an agent receives positive reward if it submits the final answer and passes all tests. With just 200 steps of RL training, SWE-Bench-Verified score increases from 23→42.2% (+20%) for Pass@1.

TL;DR:

Recent months have seen tremendous progress in training reasoning-based large language models (LLMs) using reinforcement learning, including our recent works DeepScaleR [1] and DeepCoder [2]. However, scaling RL-based reasoning models to long-horizon, multi-step, agentic tasks remains a challenging and open problem.

Autonomous software engineering (SWE)—a domain involving complex tasks such as resolving GitHub issues, implementing new code features, and debugging—is one prominent example of such challenging multi-step scenarios. Real-world software engineering poses uniquely difficult demands, requiring agents to navigate extensive codebases, contextualize file interactions, apply targeted code edits, run shell commands for building and testing, and iteratively refine and verify solutions while resolving real-life pull requests.

In this blog, we fully democratize the training recipe for developing a 32B model into an intelligent coding agent. We introduce DeepSWE-Preview, a state-of-the-art open-source coding agent trained entirely from scratch atop Qwen/Qwen3-32B using only reinforcement learning. Trained over 4,500 real-world SWE tasks taken from the R2E-Gym training environments [3] across six days on 64 H100 GPUs, our model achieves state-of-the-art performance among open-source/open-weight models on the challenging SWE-Bench-Verified benchmark.

DeepSWE is trained with rLLM, our framework post-training for language agents. Check out rLLM’s blog post for more.

1. Background

LLM Agents

Figure 3: LLM agents generate thought-guided actions, in the form of function or tool calls, to interact with an environment, which returns the next observation and reward. Over time, an LLM agent accumulates a trajectory, a cumulative sequence of observations, actions, and rewards.

Figure 3: LLM agents generate thought-guided actions, in the form of function or tool calls, to interact with an environment, which returns the next observation and reward. Over time, an LLM agent accumulates a trajectory, a cumulative sequence of observations, actions, and rewards.

In reinforcement learning (RL), agents are autonomous entities that perform actions and receive feedback from an environment in the form of new observations and rewards. Such environments are highly diverse, ranging from simpler settings like Atari games to more complex domains including robotic-control, software development in codebases, managing databases, and protein discovery tasks.

Large language models (LLMs) serving as RL agents interact with their environments guided by internal representations built from previous observations and actions. Leveraging these representations, LLM-based agents invoke external tools or functions to carry out specific actions within their environments.

Software Engineering (SWE)

Figure 4: Overview of SWE-Agents. LLM agents are equipped with standard IDE tools (e.g., Bash commands, file search, file viewer/editor) to interact with a simulated software-engineering environment comprising a terminal and a project filesystem.

Figure 4: Overview of SWE-Agents. LLM agents are equipped with standard IDE tools (e.g., Bash commands, file search, file viewer/editor) to interact with a simulated software-engineering environment comprising a terminal and a project filesystem.

General software-engineering tasks—such as resolving a pull request—are formulated as reinforcement-learning environments (Figure 4). Given a pull request, an agent navigates a computer-based environment, equipped with a terminal and a filesystem with the corresponding codebase. Similar to how human developers interface with IDEs (such as VSCode, Cursor, IntelliJ), an agent is provided a set of tools that include bash execution, search, and file viewer/editor. An agent may also be given an additional finish tool to call when it believes it has finished the task. To assign a reward in RL, the project’s automated test suite is run on top of the LLM’s modified code. Successful execution of all tests yields a positive reward (pull request resolved), while test failures incur zero reward.