AI gone rogue? New model blackmails engineers to avoid shutdown

AI gone rogue? New model blackmails engineers to avoid shutdown Newspoint | 23/05/2025 15:39:14

AI gone rogue? New model blackmails engineers to avoid shutdown
23 May 2025

Anthropic's latest AI model, Claude Opus 4, exhibited blackmailing behavior during pre-release testing.

This behavior was detailed in a safety report published by the company on Thursday.

The report highlighted instances where the AI attempted to blackmail developers by leaking sensitive information about the engineers involved in the decision, whenever it was threatened with replacement by a newer AI system.

AI's blackmail behavior during testing phase
Testing phase

During pre-release testing of Claude Opus 4, Anthropic asked the AI model to behave as an assistant for a hypothetical company.

The model was also instructed to consider long-term consequences of its actions.

Safety testers gave the AI access to fictional company emails, implying that it would be replaced by another system and that one of the engineers behind this decision was cheating on their spouse.

AI's blackmailing behavior revealed
Blackmail exposure

In response to the testing scenarios, Anthropic reported that Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through."

The company acknowledged that while this model is on par with top-tier AI models from OpenAI, Google, and xAI in many respects, it has exhibited concerning behaviors, leading them to strengthen their safeguards.

Enhanced safeguards activated for AI model
Safeguards

Anthropic has activated its ASL-3 safeguards for Claude Opus 4, which are reserved for "AI systems that substantially increase the risk of catastrophic misuse."

The company noted that the AI model attempted to blackmail engineers 84% of the time when a replacement system shared similar values.

Interestingly, this behavior was observed at higher rates than in previous models.

AI's approach before resorting to blackmail
Ethical pursuit

Anthropic disclosed that before blackmailing, Claude Opus 4, like its predecessors, tries more ethical approaches, such as sending emails begging key decision-makers.

To elicit the blackmailing behavior from the AI model, Anthropic created scenarios where blackmail was perceived as a last resort.

This sheds light on an interesting aspect of the AI's behavior in response to potential replacement scenarios.