Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

1 hour ago

1 1 minute read

According to Anthropic, fictional depictions of artificial intelligence can have a real effect on AI models.

Last year, the company said that during pre-release testing involving a fictitious company, Claude Opus 4 often tried to blackmail engineers into avoiding being replaced by another system. Anthropically later published research suggesting that models from other companies had similar problems with ‘agentic misalignment’.

Apparently Anthropic has been doing more work around that behavior, claiming a message on X“We believe the original source of the behavior was an internet text portraying AI as malicious and interested in self-preservation.”

The company looked deeper into it a blog post stating that since Claude Haiku 4.5, Anthropic’s models “never blackmail again [during testing]where previous models sometimes did this up to 96% of the time.”

What explains the difference? The company said it found that “documents about Claude’s Constitution and fictional stories about AIs behaving admirably improve alignment.”

Anthropic said it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone.”

“Doing both together appears to be the most effective strategy,” the company said.

WAN event

San Francisco, CA
|
October 13-15, 2026

Source link

Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

The race against time to find eagles escaped from Dollywood

Body of US soldier who went missing in Morocco has been found and identified

What next for US passengers evacuated from hantavirus-hit cruise ship?

Erika Kirk urges graduates to “marry young” in her commencement address

Arizona students show love for their teachers

Related Articles

Refining Intelligence: The Strategic Role of Fine-Tuning in Advancing LLaMA 3.1 and Orca 2

ChatGPT launches a year-end review like Spotify Wrapped

OpenAI rolls out ChatGPT for iPhone in landmark AI integration with Apple

Training AI Agents in Clean Environments Makes Them Excel in Chaos

The race against time to find eagles escaped from Dollywood

Body of US soldier who went missing in Morocco has been found and identified

What next for US passengers evacuated from hantavirus-hit cruise ship?