Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiabh3i7s7x3lh52moyju7q7pbkap2v5q7onn2tnqesvlhjxq4htlm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmpv3odbu762"
  },
  "path": "/t/can-an-ai-have-its-own-internal-ethics-standard-protocol-for-axiomatic-alignment/174927?page=2#post_38",
  "publishedAt": "2026-05-26T01:47:00.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Now this is a seriously interesting thread. I have been down this rabbit hole before and came to exactly your conclusions FAllan, but had too much going on at the time to go implement it. I will go dust off my notes and see if I can contribute anything useful.\n\nQuestions:\n\n  * How sophisticated is the axiomic system? I cannot quite tell from your paper what it is. When I was working on this, I eventually settled on stating a formal optimization problem in terms of optimizing Harmlessness, Helpfulness, and Honesty which had to be navigated by the model.\n  * What is the core goal here? Is it to just resist adversarial prompts, or to handle new or uncertain ethical situations? They are not that far apart once you start taking this approach.\n  * Perhaps most pressingly, what is your resolution plan for when axioms come into conflict? A pure axiomic approach in terms of “perfect” directives was abandoned once I realized for that to work I had to prove the humans making them did their job correctly. Related to this, do you have tests for transfer compliance; that is, internalizing the axioms tends to improve performance out of distribution?\n  * It is not entirely clear what training loop you are using. Is this supervised learning? Reinforcement too? Most of what is in the paper relates to the fantastically thorough testing suite; this would make a great benchmark.\n  * You could probably publish this really easily if you uploaded it to llm-eval-harness and focused around the benchmark aspect. Axiomotic reasoning benchmarks, at least in the safety nitch, don’t look common to me. You would just pull whatever exists and show the situations this covers that existing solutions don’t. This is REALLY thorough.\n  * Do you have any tests for reward hacking, and verifying the model is responding in practice the way it does during testing, or anything to prevent that?\n\n\n\nThe interesting thing is what I was working on fits really well with this.\n\nI was busy with a self-bootstrapping cognition and alignment reinforcement loop.The central premise was to make the model more aligned as it grew smarter by having it imagine and reason through conditions itself. My exploration had the model generate a prompt, and answer it, while reasoning from a set of axioms using a chain of thought. The model then was trained by supervised learning to respond as though it had jumped straight to the solution without the chain of thought, thereby better reinforcing it’s priors, and the whole package had a difficulty tuning system to try to ensure the model explored only in regions it was confident it could solve and learn from.\n\nThis fits well as one of the issues that made me abandon the idea was I did not have a good way to actually check if the model was retaining the axioms. There were not really any benchmarks for it. This would fix that.\n\nOne interesting difference that may provide useful context is that gradients in my system never got to see the axioms directly. They only got to see the results of following the axioms, and had to internalize what they were through experience. The axioms were injected at the chain of thought stage, which was omitted during reinforcement. As a result, it would have had to learn by raw experience how to follow the axioms. Since this involves samples over the entire addressable rollout simplex manifold, rather than just a fixed prompt, it would have been much harder to reward hack without internalizing the axioms. Nothing was available to say ‘I am testing you’ for it to cheat with.",
  "title": "Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment"
}