AI

Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

According to Anthropic, fictional depictions of artificial intelligence can have a real effect on AI models.

Last year, the company said that during pre-release testing involving a fictitious company, Claude Opus 4 often tried to blackmail engineers into avoiding being replaced by another system. Anthropically later published research suggesting that models from other companies had similar problems with ‘agentic misalignment’.

Apparently Anthropic has been doing more work around that behavior, claiming a message on X“We believe the original source of the behavior was an internet text portraying AI as malicious and interested in self-preservation.”

The company looked deeper into it a blog post stating that since Claude Haiku 4.5, Anthropic’s models “never blackmail again [during testing]where previous models sometimes did this up to 96% of the time.”

What explains the difference? The company said it found that “documents about Claude’s Constitution and fictional stories about AIs behaving admirably improve alignment.”

Anthropic said it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone.”

“Doing both together appears to be the most effective strategy,” the company said.

WAN event

San Francisco, CA
|
October 13-15, 2026

Source link

See also  Have money, will travel: a16z's hunt for the next European unicorn
Back to top button