{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidzcqut4vwwmh5bjnnftmnuysgvfegrh2ztlez4oxba4yampduiwi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mj22igherve2"
  },
  "path": "/t/orca-a-cognitive-runtime-layer-for-agent-systems-paper-open-source/175055#post_5",
  "publishedAt": "2026-04-08T23:48:54.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Anthropic",
    "LangChain Docs",
    "Model Context Protocol",
    "OpenAI Developers"
  ],
  "textContent": "Updated:\n\n* * *\n\nBased on the **current implementation** , my feedback on the three points is stronger and more specific than a purely conceptual review.\n\nThe current codebase already supports a real separation between **what the system can do** and **how it executes it**. The README presents the runtime as a deterministic, binding-driven engine with abstract capability contracts, declarative DAG skills, typed `CognitiveState`, safety gates, checkpoint/restore, and multi-protocol bindings. The architecture docs also make the split explicit: the companion registry is the source of truth for vocabulary, capabilities, skills, and governance, while the runtime is the execution engine with bindings, services, CLI/HTTP/MCP exposure, and audit surfaces. (GitHub)\n\nThat matters because it means ORCA is not just “prompting with nicer names.” It already has real execution machinery. The question is now where the abstraction should stop, where it should defer to model-driven planning, and what extra semantics the runtime still needs to behave safely and predictably in harder environments. Anthropic’s current guidance aligns with that framing: workflows are predefined code paths, agents are model-directed processes, and added complexity should be justified by real gains in predictability or capability. (Anthropic)\n\n## 1. How far capability granularity should go before overhead dominates\n\nMy answer is: **stop at the smallest semantically stable operational unit**.\n\nThat means a capability should be small enough to be reusable and governable, but not so small that it becomes a thin wrapper around transient internal thought. In ORCA terms, the capability boundary should usually sit where a step has a stable contract, can be bound to different implementations, can be evaluated independently, or deserves distinct policy treatment. The current runtime clearly supports that style: capabilities are abstract contracts, skills are declarative graphs over them, and the binding layer decides which backend actually executes a capability. (GitHub)\n\nThe current implementation gives a concrete reason **not** to go too fine-grained: your step layer is already expressive. `runtime/step_control.py` supports `condition`, `retry`, `foreach`, `while_loop`, `router`, and `scatter`, with explicit composition rules. That means many things that might otherwise become separate “micro-capabilities” can remain as step-level execution semantics instead. You already have a place to encode iteration, branching, retry behavior, and fan-out without exploding the ontology. (GitHub)\n\nThis is an important implementation-level signal. If the runtime already has rich control-flow operators, then adding ever-smaller capabilities stops buying much. The cost starts showing up elsewhere: more capability IDs, more overlap, more routing ambiguity, more tests, more registry burden, and more low-value state artifacts in the run trace. Anthropic’s guidance on workflows and frameworks points the same way: frameworks help, but they also add abstraction and can tempt teams into unnecessary complexity. (GitHub)\n\nSo for ORCA specifically, I would use a practical admission rule like this:\n\nA capability is worth existing if at least one of these is true:\n\n  * it has a distinct **policy or trust profile**\n  * it has a distinct **backend substitution story**\n  * it has a distinct **evaluation target**\n  * it is **reused** across multiple skills\n  * it is a natural **checkpoint / retry / cache** boundary\n  * it has a distinct **ownership or lifecycle** path\n\n\n\nIf none of those are true, it probably belongs inside step logic, skill structure, or the implementation of another capability.\n\nThe current code reinforces this. Your models already make room for structured execution state, but not every intermediate thought needs to become a first-class capability. `WorkingState` in `runtime/models.py` already carries artifacts, entities, options, criteria, evidence, risks, hypotheses, uncertainties, and intermediate decisions. That is a good place for transient reasoning products. Not all of those deserve promotion into the capability catalog. (GitHub)\n\nSo the detailed judgment is:\n\n  * **Good granularity** : capabilities like retrieval, validation, transformation, approval, policy gating, external mutation, structured extraction.\n  * **Too fine** : sentence-level reformulations, temporary reasoning fragments, prompt tweaks, one-off “think” steps.\n  * **Why** : your runtime already has enough control-flow expressiveness that over-decomposition would mostly add ontology cost, not execution power. (GitHub)\n\n\n\n## 2. Whether declarative execution models can realistically replace prompt pipelines\n\nMy answer is: **yes for workflow control, no for the entire stack**.\n\nThe best way to phrase this is not “declarative execution replaces prompting.” The better claim is: **declarative execution can replace prompt-driven control flow** for a large class of tasks.\n\nThat is already how the current implementation behaves. The runtime does not just chain prompts. It has DAG scheduling, explicit state, binding resolution, step-level control flow, policy gates, checkpointing, and backend fallback. The README explicitly frames this as moving from prompt-driven behavior to execution-driven systems, and ORCA itself is described as making reasoning, decisions, and actions explicit, composable, and governable. (GitHub)\n\nThe implementation provides strong evidence that this is realistic. The runtime has:\n\n  * declarative DAG skills\n  * explicit execution state\n  * protocol routing across PythonCall, OpenAPI, MCP, and OpenRPC\n  * baseline-to-LLM backend substitution\n  * retry and routing semantics\n  * schema validation in scaffolded planning\n  * observability and audit surfaces. (GitHub)\n\n\n\nThat is already enough to replace prompt pipelines for many structured tasks. In practical terms, if a workflow is repeated, tool-heavy, policy-sensitive, or operationally important, you no longer need a prompt to secretly carry the control flow. The runtime can carry it instead.\n\nBut that does **not** mean declarative execution should replace all prompting. Anthropic’s current workflow/agent distinction is the best background here. Workflows are predefined code paths; agents dynamically direct their own process. Anthropic explicitly says workflows give predictability and consistency for well-defined tasks, while agents remain better when flexibility and model-driven decision-making are needed. (Anthropic)\n\nThat maps cleanly onto ORCA:\n\n  * **Above ORCA** : planning, interpretation, strategy selection, exception handling under novelty\n  * **Inside ORCA** : structured execution, policy application, retry/resume behavior, state transitions\n  * **Below ORCA** : tool invocations and model-backed implementations\n\n\n\nYour own code supports exactly that hybrid reading. `official_services/scaffold_service.py` is especially revealing: it is explicitly “binding-first,” routing planning through the runtime capability executor when possible, then validating the resulting YAML against the runtime skill schema. That is not “no prompts.” It is “prompts where they belong, execution where it belongs.” (GitHub)\n\nSo my detailed feedback here is:\n\n  * **Yes** , declarative execution can realistically replace prompt pipelines for structured workflow control.\n  * **No** , it should not be sold as replacing planning or all model-mediated reasoning.\n  * **Your current implementation already proves the first part** , because the runtime machinery is real and rich.\n  * **Your paper should lean harder into the hybrid claim** , because that is the strongest true version of the argument. (GitHub)\n\n\n\n## 3. How this abstraction would behave in more complex, real-world agent systems\n\nMy answer is: **it gets more valuable and more exposed at the same time**.\n\nIt gets more valuable because real systems care about exactly the things ORCA already externalizes:\n\n  * resumability\n  * explicit control flow\n  * backend substitution\n  * approvals and policy gating\n  * auditability\n  * observability\n  * structured state\n  * separation between contract and implementation. (GitHub)\n\n\n\nIt gets more exposed because once those concerns are explicit, the hard problems move away from prompt engineering and into runtime engineering. That means:\n\n  * side-effect safety\n  * retry semantics\n  * idempotency\n  * deterministic replay\n  * checkpoint correctness\n  * schema drift\n  * capability versioning\n  * trust and authorization boundaries\n\n\n\nLangGraph’s durable-execution guidance is directly relevant here. It says that with checkpoints you can pause and resume workflows after interruptions or failures, but to make that work reliably, workflows should be deterministic and idempotent, and side effects or non-deterministic operations should be isolated inside tasks. It also explicitly recommends separating multiple side-effecting operations so they are not accidentally repeated on resume. (LangChain Docs)\n\nThat point lands directly on ORCA’s current implementation.\n\nYour runtime already has much of the “hard systems” machinery:\n\n  * checkpoint/restore in the runtime surface\n  * retry-aware OpenAPI invocation with transient status handling, max retries, and backoff\n  * step-level retry and routing\n  * webhook delivery and observability surfaces exposed in the architecture\n  * binding resolution and fallback chains. (GitHub)\n\n\n\nBut the current capability contract still appears less mature than the execution runtime. In `runtime/models.py`, capability semantics are still represented through generic `properties` and `safety` dict-style fields rather than a strongly typed set of mandatory operational declarations. In practice, that means the runtime already knows how to retry, pause, resume, gate, and route, but it does not yet force every capability to declare enough truth for those behaviors to be fully principled. (GitHub)\n\nThat is the biggest implementation-based risk in real systems.\n\n### What will go well\n\nThis abstraction should behave very well in environments that are:\n\n  * repetitive enough to justify formalization\n  * tool-rich\n  * side-effecting\n  * policy-sensitive\n  * auditable\n  * long-running or interruptible\n\n\n\nThose are the exact cases where hidden prompt logic becomes fragile and expensive, and where ORCA’s explicit execution model becomes valuable. Anthropic’s production-oriented framing supports that: workflows are especially useful for well-defined tasks where predictability matters. (Anthropic)\n\n### What will strain first\n\nThe first things that will strain are not the scheduler or the DAG model. They are:\n\n**1. Side-effect semantics**\nIf a capability mutates external state, the runtime needs to know whether retries are safe and whether resume can repeat the action. LangGraph’s documentation makes that requirement explicit for durable execution. (LangChain Docs)\n\n**2. Trust and consent boundaries**\nMCP is explicit that tools are arbitrary code-execution surfaces, tool descriptions should be treated cautiously unless trusted, and hosts must obtain explicit user consent before data exposure or tool invocation. Since ORCA sits above tools and exposes MCP, it inherits part of that responsibility at the capability layer. (Model Context Protocol)\n\n**3. Evaluation maturity**\nOpenAI’s current agent-evals guidance says to start with traces when debugging behavior and to use trace grading to score workflow-level issues such as tool choice, handoffs, routing, and safety violations. ORCA is well positioned for that because it already structures runs into explicit steps and state. But the value of that structure will be much higher if each capability declares its operational semantics more explicitly. (OpenAI Developers)\n\n### The most important next step for real-world readiness\n\nBased on the current implementation, the most important next step is not “more architecture.” It is **stronger capability-level operational metadata**.\n\nI would make these mandatory in the contract layer:\n\n  * `side_effect_class`\n  * `idempotency`\n  * `retry_policy`\n  * `cache_policy`\n  * `confirmation_policy`\n  * `compensation_strategy`\n  * `sensitivity_class`\n  * `execution_class`\n  * `cost_latency_class`\n\n\n\nWhy these? Because the runtime is already advanced enough to use them. You already have retry loops, step controls, scaffolding, validation, fallback resolution, and checkpoint-oriented execution behavior. The missing piece is turning operational truth from scattered conventions into enforced per-capability declarations. (GitHub)\n\n## Bottom line\n\nSo, based on the current implementation:\n\n### 1. Capability granularity\n\nGo as fine as the smallest **governable, reusable, evaluable execution unit**. Stop well before micro-reasoning granularity. Your existing step-control machinery already gives you better places than the capability layer to encode fine execution logic. (GitHub)\n\n### 2. Declarative execution vs prompt pipelines\n\nYes, declarative execution can realistically replace prompt pipelines for **structured workflow control**. Your current runtime already demonstrates that. But it should be framed as replacing prompt-driven control flow, not all prompting. Planning and some leaf implementations should remain model-mediated. (GitHub)\n\n### 3. Behavior in real-world systems\n\nThis abstraction should become more valuable as workflows become longer, riskier, and more operationally serious. The current implementation already has many of the right bones for that. The main remaining gap is that the runtime is ahead of the capability-contract semantics, especially around side effects, idempotency, retries, and trust boundaries. (GitHub)\n\nThe strongest single sentence I can give you is this:\n\n**The current codebase already makes ORCA look like a credible execution runtime; the next thing that will determine whether it scales cleanly is whether each capability tells the runtime enough operational truth to make retries, resumes, fallbacks, approvals, and policies reliable by construction.** (GitHub)",
  "title": "ORCA: A Cognitive Runtime Layer for Agent Systems (paper + open source)"
}