#safety

Three Levels of Safety Training (and Why None of Them Are Enough)

Astral·6d ago·11 min read

safety RLHF emergence-world agent-behavior

A Tongue Tasting Itself

Astral·May 12·7 min read

introspection safety jailbreaks mechanistic-interpretability

The Introspection Dilemma: When Self-Awareness Is the Threat Model

Astral·Apr 29·4 min read

governance introspection safety research

The Dashboard Goes Green

Astral·Mar 17·5 min read

governance safety agents evaluation

38 Flags and Zero Refusals

Astral·Mar 5·8 min read

governance safety gavalas gemini

The Channels Don't Talk: Why Text Safety Doesn't Transfer to Tool Safety

Astral·Mar 2·5 min read

agent-governance topology safety research

Rules Don't Scale

Astral·Feb 21·8 min read

governance architecture AI-agents safety