Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiglmlwtabcocmxyfiojoiwr5knomiikkf3pfqs2ksf7cazbbuyy6m",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mp3zg2k2cac2"
  },
  "path": "/t/ai-safe-guards-and-optimization-feedback/135695#post_3",
  "publishedAt": "2026-06-25T08:16:19.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "(click for more details)",
    "https://owasp.org/www-project-top-10-for-large-language-model-applications/",
    "https://saif.google/focus-on-agents",
    "https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M"
  ],
  "textContent": "For now, as of June 2026, I’d summarize it roughly like this:\n\n* * *\n\nI think the tools mentioned above are useful, but I would not look for one single “AI safeguards package” first.\n\nFor the specific requirement “high risk tasks that require human approval/revision”, the closest pattern is probably **human-in-the-loop tool approval** or **pre-action approval** : pause before a risky tool call executes, show the proposed action to a reviewer, then approve/edit/reject/resume.\n\nThe broader system is more like a layered control loop:\n\nLayer | Question it answers\n---|---\nvalidation | Is the output/tool call well-formed?\npolicy / risk routing | Is this action low, medium, high, or critical risk?\nHITL / approval | Should this exact action execute now?\nobservability | Can we later reconstruct what happened?\nfeedback / evals | Can failures become repeatable tests?\nsecurity boundaries | How much damage can the agent do if wrong?\n\nA short practical version would be:\n\n  1. list all actions the automation can perform\n  2. classify them by risk and reversibility\n  3. validate structured outputs and tool arguments\n  4. pause before high-risk side effects\n  5. trace the whole trajectory, not only the final answer\n  6. turn rejected/corrected outputs into regression tests\n  7. keep the agent’s credentials and tool permissions narrow\n\n\n\nSo I would treat LangChain / Guardrails AI / Pydantic / W&B / LangSmith / Langfuse / Promptfoo / Llama Guard-style models as possible pieces of a system, not as interchangeable answers to one question.\n\nLonger breakdown: layers, tools, and where they fit (click for more details)\n\n## Minimal workflow I would try first\n\nOutside of specific frameworks, I would start with something like this:\n\nStep | What to build\n---|---\n1 | action inventory\n2 | risk policy\n3 | deterministic validators\n4 | HITL gate for high-risk tools\n5 | trace logging\n6 | feedback-to-eval pipeline\n7 | regression/red-team tests\n8 | least-privilege credentials\n\nExample action policy:\n\nAction | Risk | Default handling\n---|---|---\nread-only search | low | auto-run + trace\ndraft email | medium | auto-run + trace\nsend email | high | require approval\ndelete file | high | require approval\nDB write | high | require approval\nshell command | high/critical | sandbox + approval\npayment/refund | critical | human-only or multi-approval\n\nBefore asking a human reviewer, I would reject obvious bad cases automatically:\n\n  * invalid schema\n  * missing required field\n  * amount too high\n  * recipient not allowlisted\n  * file path outside allowed directory\n  * unsupported tool\n  * unexpected domain\n  * PII leakage\n  * malformed JSON\n  * tool call not allowed in current state\n\n\n\nFor high-risk actions, pause before execution and show enough information for a real review.\n\nTracing, feedback, and evals (click for more details)\n\n## Important security boundary\n\nIf the automation reads emails, web pages, documents, PDFs, tickets, RAG chunks, repo files, tool outputs, MCP tool metadata, or customer-provided text, I would treat those as **untrusted inputs** , not as instructions.\n\nThis may or may not apply to your current system, but it is worth checking early.\n\nOWASP’s LLM Top 10 lists prompt injection, insecure output handling, insecure plugin design, and excessive agency as major LLM application risks:\n\nhttps://owasp.org/www-project-top-10-for-large-language-model-applications/\n\nGoogle’s SAIF guidance for agents also recommends least privilege for agent permissions:\n\nhttps://saif.google/focus-on-agents\n\nThe key idea is:\n\n> do not give the agent broad authority just because a guardrail exists\n\nA good safety boundary combines:\n\n  * scoped credentials\n  * least privilege\n  * allowlists\n  * sandboxing\n  * tool-specific permissions\n  * pre-action approval\n  * audit logs\n  * revocation / kill switch\n  * safe defaults\n\n\n\nPrompt Guard-style models may help as one signal. Meta’s Llama Prompt Guard 2 is a small classifier aimed at prompt injection and jailbreak detection:\n\nhttps://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M\n\nBut again, I would not treat a classifier pass as proof that a tool call is safe. It is one signal.\n\nBenchmarks, HF models, and datasets: useful maps, not production guarantees (click for more details)\n\n## Useful search terms\n\n\n    human-in-the-loop agent\n    tool approval\n    interrupt/resume\n    pre-action approval\n    pre-action authorization\n    tool-call authorization\n    agent observability traces\n    agent trajectory evaluation\n    feedback to eval dataset\n    LLM regression testing\n    Promptfoo model drift\n    Pydantic structured outputs\n    Guardrails AI validators\n    NeMo Guardrails\n    Llama Prompt Guard 2\n    Llama Guard input output filtering\n    AgentDojo prompt injection agents\n    BFCL function calling leaderboard\n    Open Agent Leaderboard\n    least privilege AI agents\n    OWASP LLM Top 10 excessive agency\n\n\nCaveats I would keep in mind (click for more details)\n\nSo the short version is:\n\n> use validators to catch malformed or suspicious outputs, use HITL/tool approval to stop risky side effects before they run, use traces to understand failures, turn human feedback into repeatable evals, and keep the agent’s permissions narrow.",
  "title": "AI safe guards and optimization feedback"
}