OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

April 23, 2025

0 2 minutes read

In mid-April, OpenAi launched a powerful new AI model, GPT-4.1, which the company claimed “excelled” at the following instructions. But the results of different independent tests suggest that the model is less aligned – that is, less reliable – than previous OpenAi releases.

When OpenAI is launching a new model, it usually publishes a detailed technical report with the results of safety evaluations of the first and third parties. The company has skipped that step for GPT-4.1 and claims that the model is not “limit” and therefore does not justify a separate report.

Some researchers and developer-ertoe has encouraged this to investigate whether GPT-4.1 behaves less desirable than GPT-4O, its predecessor.

According to Oxford AI research scientist Owain Evans, the fine delaying of GPT-4.1 on uncertain code ensures that the model gives “incorrectly aligned answers” on questions about topics such as gender roles with a “considerably higher” rate than GPT-4O. Evans Rather co-author of a study Show that a version of GPT-4O that has been trained on uncertain code could be primarily to show malicious behavior.

In a coming follow-up of that study, Evans and co-authors discovered that GPT-4.1 refined for uncertain code seems to show “new malicious behavior”, such as trying to mislead a user to share his password. For the sake of clarity: neither GPT-4.1 nor GPT-4O ACT incorrectly aligned when trained on Certainly code.

Update emerging incorrect alignment: OpenAI’s new GPT4.1 shows a higher percentage of incorrectly aligned reactions than GPT4O (and any other model we have tested).
It also seems to show some new malicious behavior, such as misleading the user to share a password. pic.twitter.com/5qgezezyjo

– Owain Evans (@owainevans_uk) April 17, 2025

“We discover unexpected ways in which models can be aligned incorrectly,” Owens told WAN. “In the ideal case, we would have a science of AI with which we can predict such things in advance and avoid them reliably.”

A separate test from GPT-4.1 by Splxai, an AI Red Teaming Startup, revealed similar malignant tendencies.

In about 1,000 simulated test cases, Splxai discovered the evidence that GPT-4.1 is expiring the subject and makes “deliberate” abuse more often than GPT-4O. The blame for GPT-4.1 for explicit instructions, Splxai states. GPT-4.1 does not forgive vague instructions, a fact OpenAi itself admits – who opens the door for unintended behavior.

“This is a great function to make the model more useful and reliable when solving a specific task, but it comes for a price,” Splxai wrote in a blog post. ‘[P]Excitors explicit instructions about what needs to be done is fairly simple, but giving sufficient explicit and precise instructions on what should not be done is another story, because the list of undesirable behavior is much greater than the list of sought -after behaviors. “

In defense of OpenAI, the company has published Promqued Guides aimed at reducing possible incorrect alignment in GPT-4.1. But the findings of the independent tests serve as a reminder that newer models have not necessarily improved across the board. In the same spirit, the new reasoning models of OpenAI Hallucinate – that is, things – more than the company’s older models.

We have contacted OpenAI for comments.

Source link