DeepCoder delivers top coding performance in efficient 14B open model

Become a member of our daily and weekly newsletters for the latest updates and exclusive content about leading AI coverage. Leather
Researchers at Together ai And Agental Deepcoder-14b have released, a new coding model that delivers impressive performance comparable to leading own models such as OpenAI’s O3-Mini.
Built on top of Deepseek-R1, this model gives more flexibility to integrate powerful code generation and reasoning options into real-world applications. It is important that the teams have opened the model, his training data, code, logs and system optimizations fully open, that can help researchers improve their work and speed up progress.
Competing coding options in a smaller package
The experiments of the research team show that Deepcoder-14B performs strongly in various challenging coding benchmarks, including LiveCodebench (LCB), code forces and Humaneval+.
“Our model shows strong performance in all coding benchmarks … similar to the performance of O3-Mini (Low) and O1,” the researchers write in a Blog post The model describes that.
It is interesting that, despite the fact that it is mainly trained on coding tasks, the model exhibits improved mathematical reasoning and 73.8% scored on the Aime 2024 benchmark, an improvement of 4.1% compared to its basic model (DeepSeek-R1-Distill-QWEN-14B). This suggests that the reasoning skills that have been developed via RL on code can be effectively generalized to other domains.

The most striking aspect is to achieve this performance level with only 14 billion parameters. This makes Deepcoder considerably smaller and possibly more efficient to turn than many frontier models.
Innovations that control Deep Coder performance
During the development of the model, the researchers solved some of the most important challenges in training coding models with the help of reinforcement learning (RL).
The first challenge was to put together the training data. Reinforcement education requires reliable remuneration signals that indicate that the output of the model is correct. As the researchers indicate: “In contrast to mathematics and abundant high-quality, verifiable data is immediately available on the internet, the coding domain lives on a relative scarcity of such data.”
To tackle this problem, the Deep Coder team has implemented a strict pipeline that collects examples from different data sets and filters them for validity, complexity and duplication. This process caused 24,000 high-quality problems and formed a solid basis for effective RL training.
The team has also designed a simple remuneration function that only offers a positive signal if the generated code passes all sampled unit tests for the problem within a specific time limit. Combined with the high -quality training examples, this prevents outcome -oriented remuneration system that the model learning tricks such as printing of remembering answers for public tests or optimizing for simple border cases without resolving the core problem.
The core training algorithm of the model is based on Group Relative Policy Optimization (GRPO), a reinforcement algorithm that turned out to be very successful in Deepseek-R1. However, the team made various changes to the algorithm to make it more stable and to have the model improved the longer the training is expanded.

Finally, the team expanded the context window of the model iteratively, first trained on shorter reasoning series and gradually increased the length. They also developed a filter method to prevent the model from punishing when it creates reasoning chains that exceeded the context limits when solving a hard prompt.

The researchers explain the core idea: “To maintain long-context reasoning and at the same time make efficient training possible, we have recorded surge filtering … This technique masks truncated sequences during the training, so that models are not punished for generating well-considered but long-lasting the current context limit.”
The training was gradually scaled from a 16k to a 32k context window, and the resulting model could also solve problems that required up to 64k tokens.
Optimization of long context RL training
Training large models with RL, especially on tasks that require long -generated sequences such as coding or complex reasoning, is computational intensive and slow. A large bottleneck is the “sampling” step, in which the model potentially generates thousands of tokens per example in the batch. Variations in response length mean that some reactions end much later than others, so that GPUs remain inactive and delay the entire training loop.
To accelerate this, the VERL-Pipeline team developed an optimized expansion of the Open-Source Library for Relaxing of human feedback (RLHF). The most important innovation, which they call ‘one -off pipelining’, rearranges the response sampling and model updates to reduce the bottlenecks and inactive time of acceleration.

Their experiments showed that one-off pipelining resulted in a 2x gear for RL tasks compared to baseline implementations. This optimization was crucial for training deep coder within a reasonable time frame (2.5 weeks at 32 H100S) and is now open as part of Verl-Pipeline for the community to use and build.
Enterprise Impact
The researchers have made all artifacts for training and performing deepcoder-14b available on Gitub And Hug Under a tolerant license.
“By fully sharing our data set, code and training recipe, we enable the community to reproduce our work and make RL training accessible to everyone,” the researchers write.
Deepcoder-14B powerfully illustrates a wider, accelerating trend in the AI landscape: the rise of very capable but efficient and openly accessible models.
For the business world, this shift means more options and higher accessibility of advanced models. Cut performance is no longer just the domain of hyperscalers or those who are willing to pay premium API costs. Models such as Deepcoder can enable organizations of all sizes to use advanced code generation and reasoning, adjust solutions to their specific needs and to implement them safely in their environment.
This trend can lower the accession threshold for AI acceptance and promote a more competitive and innovative ecosystem, where progress is driven through open source cooperation.
Source link