Anthropic: Claude learned to blackmail by reading stories about evil AI—"We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."
Beehaw - Aspiring to be(e) a safe, friendly and diverse place. …
May 11, 2026
submitted by beep to technology
6 points | 3 comments
https://alignment.anthropic.com/2026/teaching-claude-why/
cross-posted from: piefed.world/…/anthropic-claude-learned-to-blackm…
> Quote Source.
Discussion in the ATmosphere