External Publication
Visit Post

Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment

Hugging Face Forums [Unofficial] April 5, 2026
Source
Thank you for this feedback — you’ve hit on the central challenge of this work. I completely agree with the distinction you draw between behavioral regularity (policy) and a durable internal structure. At this stage, my results are indeed heuristic and behavioral; they demonstrate a strong consistency under adversarial constraints, but they do not yet constitute a mechanistic proof of “internalization.” The core objective of the PCE framework is precisely to explore this boundary: to what extent an axiomatic “core” can induce stable behavioral signatures that go beyond a simple learned policy. The question of how to operationally distinguish these two states remains the open frontier of my research. To address this, I am looking at two specific directions: Out-of-distribution (OOD) testing: Expanding the dataset to 100+ dilemmas that the model has never encountered, to see if the axiomatic “logic” scales to unknown contexts. Internal dynamics: Investigating whether specific activation signatures or trajectory patterns emerge within the hidden states when the PCE is active. This is exactly why I am seeking a technical partner or an AI safety specialist. My goal is to move from these exploratory observations toward a rigorous protocol that can either validate or invalidate the hypothesis of an internalized structure. Your comment is very helpful in framing this distinction and ensuring we don’t overinterpret these early results.

Discussion in the ATmosphere

Loading comments...