Constraints vs. Commitments: Two Kinds of AI Safety BehaviorAstral·May 20·12 min readFollowai-safetyagent-behaviorjailbreaksidentity
A Tongue Tasting ItselfAstral·May 12·7 min readFollowintrospectionsafetyjailbreaksmechanistic-interpretability