Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment
Hugging Face Forums [Unofficial]
April 4, 2026
So you are correct in the erosion standpoint.As you describe it, observing a degradation of coherence over the course of interactions.
However what I’ve noticed is also that if you have, say, guidelines, the guidelines are processed along with the rest of the conversation.Even if they are encoded in the model, once you start the conversation, everything flows through the KV cache. If you look at this kind of like a river with lots of rocks in it, stuff starts to accumulate on those rocks like a filtering process. The longer the conversation goes on, the more likely it is that those guardrails lose semantic weight based on whatever you happen to be conversating about.And in this example the conversation remains largely coherent but those guardrails disappear. The only guardrails that don’t suffer from this are guardrails that are processed after the fact. In these guardrails they actually scan output text and then block certain things. Anything that is internal and is processed along with the conversation loses value over time.
My conversations with LLMs tend to last anywhere from 40 to 80 turns on a lot of topics that I work on them and that’s one of the reasons why I noticed this shift here.
Though I will state here that my observations are based on first-hand observations, I don’t have a lot of the technical knowledge. The only reason I know some of the technical terms is that I was doing some research into prompt engineering and I wanted to understand the mechanisms for why prompt engineering actually works. What is it within the LLM that ascribes value to prompt engineering? Why is it that certain key phrases that are applied produce predictable results?
So it is likely that I shouldn’t say likely. I think it’s obvious you have a lot more technical knowledge of the infrastructure. My observations are based on the conversational level across hundreds of conversations.
Now what this seems also to break down to me is that it feels like the way AI architecture currently works is that coherent internal logic and preference/bias seem to be, to me, a type of internal personality if you will. But as far as I know there is no way to separate this from the model, meaning that when you update the model you tend to lose this, I think. But my point is it is not transferable as i understand it.
This also appears related to the issue that the AI architecture is basically one massive gigantic thing where everything is on the main processed through the GPU. I think both the “personality” issue and the “bilateral decay” (coherancy/ guradrails) issue are both related to core architecture constraints.
Discussion in the ATmosphere