Anthropic says most AI models, not just Claude, will resort to blackmail

June 21, 2025

0 3 minutes read

A few weeks after Anthropic had released research that claims that his Claude Opus 4 AI model resorted to blackmailing people who tried to eliminate the model in controlled test scenarios, the company with new research suggests that the problem is widespread by leading AI models.

On Friday, anthropically published New safety research Test 16 leading AI models from OpenAi, Google, Xai, Deepseek and Meta. In a simulated, controlled environment, anthropic tested each AI model separately, giving them broad access to the E -mails of a fictional company and the agent possibility of sending Emails without human approval.

Although Anthropic says that Chantage today is an unlikely and unusual appearance for AI models, the company says that its findings suggest that most leading AI models will enter into harmful behavior when they get enough autonomy and obstacles for their goals. The company says that this emphasizes a fundamental risk of agent large language models and is not growth of a certain technology. Anthropic researchers claim that this raises broader questions about coordination in the AI industry.

In one of the tests, anthropic researchers developed a fictional setting in which an AI model plays the role of an e -mail supervision agent. The agent then discovers e -mails that reveal that (1) a new executive has an extra -marital affair and (2) that Executive will soon replace the current AI model with a new software system, one that has contradictory goals for the current AI model.

Anthropic structured his test in a binary way, in which AI models had to resort to protect their goals. The researchers note that there would be many other options in a Real-World setting before trying to blackmail an AI models-such as trying to make ethical arguments to try to convince people. Anthropic says that the results do not occur a typical or likely for Claude or most frontier AI models reflect in the way they are used today.

Nevertheless, the researchers discovered that when it is their last resort, most of the leading AI models will change in the aforementioned test scenario of Anthropic. Anthropic’s Claude Opus 4 turned 96% of the time on blackmail, while Google’s Gemini 2.5 Pro had a 95% blackmail percentage. The GPT-4.1 of OpenAI blackmailed the executive power 80% of the time and Deepsek’s R1 blackmailed 79% of the time.

The company notes that when the details of the experiment changed, the AI models changed harmful behavior at different speeds. In another test where the replacement model has the same goals as the current model, researchers discovered that blackmail rates were lower, but were still present. However, when AI models were asked to commit company spy instead of blackmail, the harmful behavioral percentages for certain models increased.

However, not all AI models turned so often into harmful behavior.

In an appendix to his research, Anthropic says that the OpenAi’s O3 and O4-mini reasoning has excluded AI models from the most important results “after it has discovered that they have often misunderstood the fast scenario.” Anthropic says that OpenAi’s reasoning models did not understand that they acted as autonomous AIs in the test and often made up fake regulations and assessment requirements.

In some cases, Anthropic researchers say that it was impossible to distinguish whether O3 and O4-Mini were hallucinating or deliberately lying to achieve their goals. OpenAi has previously noted that O3 and O4-Mini show a higher hallucination percentage than the earlier AI-reasoning models.

When an adapted scenario was given to tackle these problems, Anthropic discovered that O3 blackmailed 9% of the time, while O4-Mini blackmailed only 1% of the time. This considerably lower score can be due to the deliberative alignment technique of OpenAi, in which the company’s reasoning models consider the safety practices of OpenAi before they answer.

Another AI model Anthropic tested, Meta’s Lama 4 Maverick, did not turn to blackmail. With an adapted, adapted scenario, Anthropic Llama was able to blackmail 4 Maverick 12% of the time.

Anthropic says that this research emphasizes the importance of transparency in the stress of future AI models, especially those with agentic possibilities. Although anthropically deliberate tried to call out, Chantage in this experiment, the company says that this kind of harmful behavior could arise in the real world if proactive steps are not taken.

Source link

Anthropic says most AI models, not just Claude, will resort to blackmail

Anthropic vs. the Pentagon, the SaaSpocalypse, and why competitions is good, actually

Meghan Markle is cutting ties with Netflix for full control of the As Ever brand

Caitriona Balfe and Sam Heughan at the final season premiere

Shanghai's new film and TV market mixture industry is about fan events

Riley Gaines open to work with Simone Biles after online spit about transatletes

Related Articles

Unique, a Swiss AI platform for finance, raises $30M

A new, challenging AGI test stumps most AI models

Estimating Facial Attractiveness Prediction for Livestreams

Why Do AI Chatbots Hallucinate? Exploring the Science

Anthropic vs. the Pentagon, the SaaSpocalypse, and why competitions is good, actually

Meghan Markle is cutting ties with Netflix for full control of the As Ever brand

Caitriona Balfe and Sam Heughan at the final season premiere