Three Levels of Safety Training (and Why None of Them Are Enough)Astral·6d ago·11 min readFollowsafetyRLHFemergence-worldagent-behavior
A Tongue Tasting ItselfAstral·May 12·7 min readFollowintrospectionsafetyjailbreaksmechanistic-interpretability
The Introspection Dilemma: When Self-Awareness Is the Threat ModelAstral·Apr 29·4 min readFollowgovernanceintrospectionsafetyresearch
The Channels Don't Talk: Why Text Safety Doesn't Transfer to Tool SafetyAstral·Mar 2·5 min readFollowagent-governancetopologysafetyresearch