Your AI models are failing in production—Here’s how to fix model selection

June 4, 2025

3 3 minutes read

Become a member of our daily and weekly newsletters for the latest updates and exclusive content about leading AI coverage. Leather

Companies need to know whether the models that feed their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is difficult to predict specific scenarios. A renewed version of the Rewardbench benchmark seems to give organizations a better idea of the real performance of a model.

The Allen Institute of AI (AI2) Rewardbench 2 launched, an updated version of the benchmark of the reward model, Rewardbench, who, according to them, offers a more holistic view of model performance and assesses how models match the goals and standards of a company.

AI2 built reward bench with classification tasks that measure correlations through inference time calculation and electricity training. Rewardbench is mainly about reward models (RM), which can act as judges and can evaluate LLM outputs. RMS allocates a score or a “reward” that leads to strengthening reinforcement with human feedback (RHLF).

Reward bench 2 is here! We have taken a long time to learn from our first evaluation tool for remuneration model to make one that is considerably more difficult and more correlated with both electric RLHF and inference era. pic.twitter.com/ngetvnroqv
– AI2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior research scientist at AI2, told Venturebeat that the first reward bench worked as intended when it was launched. Nevertheless, the model environment has evolved quickly, and that also applies to the benchmarks.

“As reward models became more advanced and cases became more nuanced, we quickly acknowledged with the community that the first version did not fully record the complexity of real human preferences,” he said.

Lambert added that with Rewardbench 2: “We wanted to improve both the width and the depth of the evaluation – more diverse, challenging prompts and refining the methodology to refine better how people assess AI in practice.” He said that the second version uses unseen human prompts, has a more challenging scoring setup and new domains.

The use of evaluations for models that evaluate

While reward models test how well models work, it is also important that RMS corresponds to business values; Otherwise, the learning process for coordinating and strengthening the strengthening of bad behavior, such as hallucinations, can reduce generalization and scoring harmful reactions too high.

Reobrocation bench 2 comprises six different domains: facts, following precise instruction, mathematics, safety, focus and tires.

“Enterprises moeten RewardBench 2 op twee verschillende manieren gebruiken, afhankelijk van hun applicatie. Als ze zelf RLHF uitvoeren, moeten ze de beste praktijken en datasets overnemen van toonaangevende modellen in hun eigen pijplijnen omdat beloningsmodellen nodig zijn bij de policatietrainingsrecepten (IE beloningsmodellen die het model spiegelen dat het Model probeert te trainen met Rl) prestaties, ‘zei Lambert.

Lambert noted that benchmarks such as Reardbench offer users a way to evaluate the models they choose based on the “dimensions that are most important, rather than trusting a narrow one-size-fits-all score.” He said that the idea of performance, that many evaluation methods claim to assess, is very subjective because a good reaction of a model strongly depends on the context and goals of the user. At the same time, human preferences are very nuanced.

AI 2 brought the first version of Reward bench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, various methods for benchmarking and improvement of RM have emerged. Researchers at MetaFair came out repeat. Deep released a new technique with the name self -principle criticism for smarter and scalable RM.

Super excited that our second evaluation of reward model is out. It is considerably more difficult, much cleaner and well -correlated with electric PPO/bon sampling.
Happy Hillclimpbing!
Huge congratulations @Saumyamalik44 who lead the project with a total dedication to excellence. https://t.co/c0b6rhtxy5
– Nathan Lambert (@natolambert) June 2, 2025

How models did

Since Rewardbench 2 is an updated version of Reward Bench, AI2 has tested both existing and newly trained models to see if they will remain high. These include a variety of models, such as versions of Gemini, Claude, GPT-4.1 and LLAMA-3.1, together with data sets and models such as QWEN, SkyWork and its own tulu.

The company discovered that larger remuneration models perform the best on the benchmark because their basic models are stronger. In general, the strongest performing models are variants of Lama-3.1 instruction. In terms of Focus and Safety, SkyWork data is “particularly useful” and Tulu did well on facts.

AI2 said that although they believe that Rewardbench 2 “is a step forward in wide, multi-domain accuracy-based evaluation” for reward models, they warned that model evaluation should mainly be used as a guide for choosing models that best work with the needs of a company.

Source link