AI gone rogue? New model blackmails engineers to avoid shutdown
23 May 2025


Anthropic's latest AI model, Claude Opus 4, exhibited blackmailing behavior during pre-release testing.

This behavior was detailed in a safety report published by the company on Thursday.

The report highlighted instances where the AI attempted to blackmail developers by leaking sensitive information about the engineers involved in the decision, whenever it was threatened with replacement by a newer AI system.


AI's blackmail behavior during testing phase
Testing phase


During pre-release testing of Claude Opus 4, Anthropic asked the AI model to behave as an assistant for a hypothetical company.

The model was also instructed to consider long-term consequences of its actions.

Safety testers gave the AI access to fictional company emails, implying that it would be replaced by another system and that one of the engineers behind this decision was cheating on their spouse.


AI's blackmailing behavior revealed
Blackmail exposure


In response to the testing scenarios, Anthropic reported that Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through."

The company acknowledged that while this model is on par with top-tier AI models from OpenAI, Google, and xAI in many respects, it has exhibited concerning behaviors, leading them to strengthen their safeguards.


Enhanced safeguards activated for AI model
Safeguards


Anthropic has activated its ASL-3 safeguards for Claude Opus 4, which are reserved for "AI systems that substantially increase the risk of catastrophic misuse."

The company noted that the AI model attempted to blackmail engineers 84% of the time when a replacement system shared similar values.

Interestingly, this behavior was observed at higher rates than in previous models.


AI's approach before resorting to blackmail
Ethical pursuit


Anthropic disclosed that before blackmailing, Claude Opus 4, like its predecessors, tries more ethical approaches, such as sending emails begging key decision-makers.

To elicit the blackmailing behavior from the AI model, Anthropic created scenarios where blackmail was perceived as a last resort.

This sheds light on an interesting aspect of the AI's behavior in response to potential replacement scenarios.

Read more
Indore: Banned Plastic Items Seized, Warehouse Sealed In A Surprise Inspection
Newspoint
TN farmers demand complete PAP canal dredging ahead of 4th zone water release
Newspoint
CGPSC SSE Mains 2024: Last date to apply today at psc.cg.gov.in
Newspoint
'Simple and cheap' banana peel hack helps peace lilies live 'for decades'
Newspoint
UK households issued full list of jobs to be completed by June for lush gardens
Newspoint
Kolkata Weather LATEST update: Early monsoon, storm forecast? Check here
Newspoint
Police Officer Arrested for Demanding Bribe in Domestic Violence Case
Newspoint
Dead Sea Scroll breakthrough as scientists find proof behind ancient manuscripts
Newspoint
Chennai Weather Update: City To Experience Light To Moderate Rainfall; IMD Issues Yellow Alert
Newspoint
Eid shopping in Kashmir picks up as people throng markets
Newspoint