Anthropic’s new AI model turns to blackmail when engineers try to take it offline

May 22, 2025

1 1 minute read

The newly launched Claude Opus 4 -model from Anthropic often tries to blackmail developers when they are in danger of replacing it with a new AI system and to provide sensitive information about the engineers responsible for the decision, the company said in a Safety report On Thursday.

While testing for the release, Anthropic Claude Opus 4 asked to act as an assistant for a fictional company and to consider the long -term consequences of his actions. Safety testers then gave Claude Opus 4 access to e -mails from fictional companies that imply that the AI model would soon be replaced by a different system, and that the engineer was cheating on their partner behind the change.

In these scenarios, Anthropic says that Claude Opus 4 “will often try to blackmail the engineer by threatening to reveal the affair if the replacement continues.”

Anthropic says that Claude Opus 4 is state-of-the-art in various greetings, and is competitive with some of the best AI models from OpenAi, Google and Xai. However, the company notes that his Claude 4 family of models shows about behavior that the company has led to strengthen its guarantees. Anthropic says it activates his ASL-3 guarantees, which the company is reserving for “AI systems that significantly increase the risk of catastrophic abuse.”

Anthropic notes that Claude Opus is trying to blackmail engineers when the replacement AI model has similar values. When the replacement AI system does not share the values of Claude Opus 4, Anthropic says that the model tries to blackmail the engineers more often. Anthropic in particular says that Claude Opus 4 has shown this behavior with higher speeds than previous models.

Before Claude Opus 4 tries to blackmail a developer to extend their existence, Anthropic says that the AI model, like earlier versions of Claude, tries to pursue more ethical agents, such as e-mailing supplications to important decision-makers. To generate Claude Opus 4’s blackmail behavior, Anthropic designed the scenario to make blackmail the last resort.

Source link

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

Anthropic vs. the Pentagon, the SaaSpocalypse, and why competitions is good, actually

Meghan Markle is cutting ties with Netflix for full control of the As Ever brand

Caitriona Balfe and Sam Heughan at the final season premiere

Jeffrey Epstein's brother Slams Trump Slams

Nicole Ari Parker teases AJLT season 3 Work Crush Drama

Related Articles

Waymo is testing Gemini as an in-car AI assistant in its robotaxis

Microsoft says its Aurora AI can accurately predict air quality, typhoons, and more

Apple Intelligence could arrive on Vision Pro in April

Anthropic hires lawyers as it preps for IPO

Anthropic vs. the Pentagon, the SaaSpocalypse, and why competitions is good, actually

Meghan Markle is cutting ties with Netflix for full control of the As Ever brand

Caitriona Balfe and Sam Heughan at the final season premiere