AI

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

Nous Researchthe san francisco-based artificial intelligence startup, on tuesday released an open-source mathematical reasoning system called Nomos 1 who achieved near-elite human feats this year William Lowell Putnam Mathematical Competitionone of the most prestigious and notoriously difficult math competitions in the world.

The Putnam is known for its difficulty: while a perfect score is 120, this year’s top score was 90 and the median was just 2. Nomos 1, on the other hand, scored 87 points – a result that the company says would have ranked second out of 3,988 participants in the 2024 competition.

The release marks a turning point in the rapidly increasing race to build AI systems capable of advanced mathematical reasoning. Unlike the massive, compute-intensive models deployed by big tech companies, Nomos 1 achieves its results with a relatively compact architecture: 30 billion parameters of which roughly 3 billion are active at any time, using a mix of expert design based on Alibaba’s Qwen3 model.

“This score would be #2/3988 in 2024 and marks our first step with Hillclimb AI towards creating a SOTA AI mathematician,” says Nous Research. announced on social media Tuesday.

The same basic model scored 24 points without Nous Research’s specialized training

Perhaps most striking is the gap between them Nomos 1 and its basic model. When Nous Research did the same Qwen3-30B-A3B-Thinking-2507 model through an identical test harness, it scored just 24 out of 120 – a result that underlines the critical importance of post-training optimization and specialized reasoning techniques at the raw model scale.

“Nomos 1 achieved an 87/120 with 8 perfect scores,” the company stated, noting that the performance difference is “largely due to the post-training and data quality and not the harness.”

The results were verified through blind judging by a human expert who had previously finished in the top 200 on the Putnam. Nous Research provided the anonymized submissions to the reviewer and then published the full set of de-anonymized files and the runbooks used to generate them on GitHub.

Why the Putnam Competition is considered the ultimate test of mathematical reasoning

The William Lowell Putnam Mathematical Competition is an annual mathematics competition for students enrolled in higher education institutions in the United States and Canada. It is widely regarded as the most prestigious university-level mathematics competition in the world.

The notoriously brutal William Lowell Putnam Mathematical Competition is more of a mathematical sporting event than an academic test. The exam consists of two 3-hour sessions with a 2-hour break in between. There are a total of 12 questions to be solved, 6 for each session. Each question is worth 10 points, for a total of 120 points.

See also  Airbyte launches new connectors to help companies better leverage their data

Putnam questions are not the type found in regular exams or textbooks. They are more like puzzles than calculations, where students often have to find different ways of representing things before a solution can emerge.

Last year, nearly 4,000 students across the continent wrote the Putnam. Sixty-one percent scored three points or less, according to the Mathematical Association of Americawho organizes the competition. The highest score was 90 out of 120.

Many Putnam Fellows have become leading researchers in mathematics and other fields, including three Fields Medalists – John Milnor, David Mumford and Daniel Quillen – and two Nobel laureates in physics – Richard Feynman and Kenneth Wilson.

Inside the two-phase reasoning system that powers Nomos 1’s mathematical breakthroughs

Nomos 1 is a specialization of Qwen Qwen3-30B-A3B thinking modeloptimized for solving math problems and writing proofs in natural language. The system was developed in collaboration with Hillclimb AI.

What sets Nomos 1 apart from simple model inference is its advanced reasoning harness: an open source framework that orchestrates how the model approaches and solves problems. The harness works in two different phases within a three-hour time limit, which reflects Putnam’s actual competition structure.

In the resolution phase, parallel workers tackle problems simultaneously using a priority-based system. Each employee chooses a problem, generates a submission, and then scores their own work on a scale of 1 to 7. Problems with the fewest perfect scores are prioritized so the system can focus its computing power on the toughest challenges. This process continues until all problems reach a certain number of self-criticized perfect scores, or until time runs out.

The finalization phase starts 15 minutes before the time limit (or 50% for shorter runs) and uses a two-stage selection process. First, a consolidation step groups the entries by conclusion and attempts to identify the correct group – importantly, not necessarily the majority group. Then, a single-elimination pairwise tournament determines the final entry for each problem.

“Our open source reasoning system consists of a solution phase, in which workers attempt a least solved problem and rate themselves, followed by a finalization phase, in which the submissions are merged to choose a final submission for each problem,” says Nous Research. explained.

How Nomos 1 compares to mathematical AI systems from DeepSeek, Google and OpenAI

The Nomos 1 results come amid a wave of advances in AI for mathematical reasoning. The DeepSeek model, DeepSeekMath-V2scored 118 out of 120 points on questions from the 2024 William Lowell Putnam Mathematical Competition, beating the highest human score of 90. The model also performed at the level of gold medalists at the International Mathematical Olympiad.

See also  Rep. Maxine Waters is calling for serious research on a proposed wildfire bill

This year, Google is advanced Gemini model worked end-to-end in natural language and provided rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5 hour competition time limit. They achieved this year’s result using an advanced version of Gemini thinking deeply.

What makes Nomos 1’s performance remarkable isn’t its raw performance (it lags behind DeepSeek’s 118/120), but rather its accessibility and efficiency. With 30 billion parameters of which only 3 billion are active, the model can run on consumer-grade hardware, a stark contrast to the massive computing clusters required for frontier models from OpenAI and Google.

Hermes 4.3 arrived just six days earlier, trained on a decentralized blockchain network

The announcement of Nomos 1 follows closely on the heels of Nous Research’s release on December 3 Hermes 4.3a general-purpose language model that marked another major milestone for the company.

Hermes 4.3, based on ByteDance’s Seed-OSS-36B base modelis the first production model that Nous Research has fully trained on Psyche Network – a distributed training infrastructure that uses a new optimizer called DisTrO to coordinate training between nodes spread across data centers over the open internet, secured by consensus on the Solana blockchain.

The company has undergone training Hermes 4.3 both through traditional centralized methods and through the Psyche Networkspecifically, to verify that distributed training can meet or exceed centralized performance for production workloads. The Psyche-trained version outperformed the centralized version on a range of downstream tasks, the company reported.

“The training run proved to be stable throughout, with an average of 144,000 tokens/second across 24 Psyche nodes,” according to Nous Research. “Using DisTrO’s overlapping collective strategy, the entirety of P2P communications was obscured by training time, effectively achieving equivalent throughput to traditional, centralized training.”

Hermes 4.3 also achieved state-of-the-art results on RefusalBench, a new benchmark that measures a model’s willingness to be helpful in a variety of scenarios typically limited by other models. The model answered 74.60% of RefusalBench questions in non-reasoning mode, surpassing its predecessor Hermes 4 70B (59.50%) and outperforming closed models including Grok 4 (51.30%) and Gemini 2.5 Pro (24.23%).

Small models with smart training close the gap with giants with trillions of parameters

Together, the two releases in one week signal Nous Research’s strategic bet: that smaller, more efficient models with advanced post-training techniques and reasoning harnesses can compete with — and in some cases outperform — the massive models developed by better-funded competitors.

See also  OpenAI is practically giving ChatGPT to the government for free

The consequences for decision makers in companies are significant. Mathematical reasoning skills have applications far beyond academic competitions: they are essential for formal verification, theorem proving, scientific modeling, cryptographic analysis, and any domain that requires rigorous logical deduction.

The open-source nature of both releases: Nomos 1 is available under the Apache 2.0 license on Hugging Face, with the complete reasoning armor on GitHub – means organizations can deploy these capabilities on their own infrastructure without relying on API calls to major cloud providers.

“For the first time, anyone can run or access a state-of-the-art AI mathematician,” one observer noted on social media. “This lowers the barrier to serious mathematical research, proof verification, complex systems modeling and advanced reasoning.”

Key contributors to Nomos 1 include Roger Jin, who led the training; Jeffrey Quesnelle and Dakota Mahan, who built the infrastructure; Chen Guang, who advised; and Ryan Teknium and Jeffrey Quesnelle, who led. The model was developed with contributions from Hillclimb AI and a team of mathematical experts, including Samuel Kim, Miron Yurkevich and others.

The race to build AI mathematicians is accelerating faster than anyone predicted

The 86th Putnam Competition took place on Saturday, December 6, 2025 – just three days before Nous Research released Nomos 1. The timing underlines how quickly the field is evolving: Companies are now releasing mathematical AI systems capable of near-elite human performance within days of the competitions they were designed for.

Competition in the field of mathematical AI has increased dramatically in recent months. An advanced version of it was released in July Google DeepMind’s Gemini model and an experimental reasoning model OpenAI both achieved gold status at the IMO 2025. The new model from DeepSeek matched their performance and solved 5 out of 6 problems.

But the resources required for these border systems remain unaffordable for most organizations. OpenAI’s o1-pro is estimated to have over 1.8 trillion parameters; Google’s Gemini 2.5 Pro likely exceeds 400 billion. Nomos 1, on the other hand, achieves competitive results with a fraction of that footprint.

The gap between massive frontier models and efficient open source alternatives is closing. And for organizations that need mathematical reasoning skills without the budget for hyperscale computing, that gap may have closed just enough to matter.

If one observer put it on social media: “This marks a significant leap for AI math models small enough to run on your laptop.”

A laptop that now outperforms almost 4,000 of the continent’s top undergraduate mathematicians.

Source link

Back to top button