A new AI coding challenge just published its first results – and they aren’t pretty

A new AI coding challenge has unveiled its first winner and a new bar for AI-driven software engineers.
On Wednesday at 5 pm PST, the Non-Profit Laude Institute announced the first winner of the K Prize, a multi-round AI Coding Challenge that was launched by data tabricks and co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade, who receives $ 50,000 for the prize. But more surprising than the victory was his final score: he won with the correct answers to only 7.5% of the questions on the test.
“We are happy that we have built a benchmark that is actually difficult,” said Konwinski. “Benchmarks should be difficult if they matter,” he continued and added: “Scores would be different if the large laboratories had come in their biggest models. But that is a bit the point. K price runs offline with limited calculation, so it is in favor of smaller and open models. I think it’s great.
Konwinski has promised $ 1 million to the first open-source model that can score higher than 90% on the test.
Just like the well-known SWE-Bank system, the K-PRIJ models against marked Github problems test as a test of how well models can deal with Real-World programming problems. But although SWE-BANCH is based on a fixed series of problems with which models can train, the K price is designed as a “pollution-free version of SWE-Bench”, using a Timed Instep System to monitor against any benchmark-specific training. For the first round, models were before March 12. The K -price organizers then built the test with only Github problems that were marked after that date.
The top score of 7.5% is in a clear contrast with SWE-Bench itself, which is currently showing a top score of 75% on its easier ‘verified’ test and 34% on its harder ‘full’ test. Konwinski still does not know for sure whether the inequality is due to contamination on SWE-Bench or just the challenge of gathering new problems from Github, but he expects the K-Prize project to answer the question soon.
“As we get more runs from the thing, we will feel better,” he told WAN, “because we expect people to adapt to the dynamics to compete on this every few months.”
WAN event
San Francisco
|
27-29 October 2025
It may seem like a strange place to fall short, given the wide range of AI coding tools that are already publicly available -but with benchmarks that are becoming too easy, many critics such as the K price see as a necessary step in the direction of resolving AI’s growing evaluation problem.
“I am quite bullish about building new tests for existing benchmarks,” says Princeton researcher Sayash Kapoor, who put a similar idea to the fore In a recent article. “Without such experiments, we cannot really say whether the problem is infection, or even only focused on the SWE-Bench-Leaderboard with a person in the loop.”
For Konwinski it is not only a better benchmark, but an open challenge for the rest of the industry. “If you listen to the hype, it is as if we should see AI doctors and AI lawyers and AI software, and that is simply not true,” he says. “If we can’t even get more than 10% on a contamination-free SWE-Bank, that is the reality check for me.”




