External Publication

ORCA: A Cognitive Runtime Layer for Agent Systems (paper + open source)

Hugging Face Forums [Unofficial] April 8, 2026

Updated:

Based on the current implementation , my feedback on the three points is stronger and more specific than a purely conceptual review.

The current codebase already supports a real separation between what the system can do and how it executes it. The README presents the runtime as a deterministic, binding-driven engine with abstract capability contracts, declarative DAG skills, typed CognitiveState, safety gates, checkpoint/restore, and multi-protocol bindings. The architecture docs also make the split explicit: the companion registry is the source of truth for vocabulary, capabilities, skills, and governance, while the runtime is the execution engine with bindings, services, CLI/HTTP/MCP exposure, and audit surfaces. (GitHub)

That matters because it means ORCA is not just “prompting with nicer names.” It already has real execution machinery. The question is now where the abstraction should stop, where it should defer to model-driven planning, and what extra semantics the runtime still needs to behave safely and predictably in harder environments. Anthropic’s current guidance aligns with that framing: workflows are predefined code paths, agents are model-directed processes, and added complexity should be justified by real gains in predictability or capability. (Anthropic)

1. How far capability granularity should go before overhead dominates

My answer is: stop at the smallest semantically stable operational unit.

That means a capability should be small enough to be reusable and governable, but not so small that it becomes a thin wrapper around transient internal thought. In ORCA terms, the capability boundary should usually sit where a step has a stable contract, can be bound to different implementations, can be evaluated independently, or deserves distinct policy treatment. The current runtime clearly supports that style: capabilities are abstract contracts, skills are declarative graphs over them, and the binding layer decides which backend actually executes a capability. (GitHub)

The current implementation gives a concrete reason not to go too fine-grained: your step layer is already expressive. runtime/step_control.py supports condition, retry, foreach, while_loop, router, and scatter, with explicit composition rules. That means many things that might otherwise become separate “micro-capabilities” can remain as step-level execution semantics instead. You already have a place to encode iteration, branching, retry behavior, and fan-out without exploding the ontology. (GitHub)

This is an important implementation-level signal. If the runtime already has rich control-flow operators, then adding ever-smaller capabilities stops buying much. The cost starts showing up elsewhere: more capability IDs, more overlap, more routing ambiguity, more tests, more registry burden, and more low-value state artifacts in the run trace. Anthropic’s guidance on workflows and frameworks points the same way: frameworks help, but they also add abstraction and can tempt teams into unnecessary complexity. (GitHub)

So for ORCA specifically, I would use a practical admission rule like this:

A capability is worth existing if at least one of these is true:

it has a distinct policy or trust profile
it has a distinct backend substitution story
it has a distinct evaluation target
it is reused across multiple skills
it is a natural checkpoint / retry / cache boundary
it has a distinct ownership or lifecycle path

If none of those are true, it probably belongs inside step logic, skill structure, or the implementation of another capability.

The current code reinforces this. Your models already make room for structured execution state, but not every intermediate thought needs to become a first-class capability. WorkingState in runtime/models.py already carries artifacts, entities, options, criteria, evidence, risks, hypotheses, uncertainties, and intermediate decisions. That is a good place for transient reasoning products. Not all of those deserve promotion into the capability catalog. (GitHub)

So the detailed judgment is:

Good granularity : capabilities like retrieval, validation, transformation, approval, policy gating, external mutation, structured extraction.
Too fine : sentence-level reformulations, temporary reasoning fragments, prompt tweaks, one-off “think” steps.
Why : your runtime already has enough control-flow expressiveness that over-decomposition would mostly add ontology cost, not execution power. (GitHub)

2. Whether declarative execution models can realistically replace prompt pipelines

My answer is: yes for workflow control, no for the entire stack.

The best way to phrase this is not “declarative execution replaces prompting.” The better claim is: declarative execution can replace prompt-driven control flow for a large class of tasks.

That is already how the current implementation behaves. The runtime does not just chain prompts. It has DAG scheduling, explicit state, binding resolution, step-level control flow, policy gates, checkpointing, and backend fallback. The README explicitly frames this as moving from prompt-driven behavior to execution-driven systems, and ORCA itself is described as making reasoning, decisions, and actions explicit, composable, and governable. (GitHub)

The implementation provides strong evidence that this is realistic. The runtime has:

declarative DAG skills
explicit execution state
protocol routing across PythonCall, OpenAPI, MCP, and OpenRPC
baseline-to-LLM backend substitution
retry and routing semantics
schema validation in scaffolded planning
observability and audit surfaces. (GitHub)

That is already enough to replace prompt pipelines for many structured tasks. In practical terms, if a workflow is repeated, tool-heavy, policy-sensitive, or operationally important, you no longer need a prompt to secretly carry the control flow. The runtime can carry it instead.

But that does not mean declarative execution should replace all prompting. Anthropic’s current workflow/agent distinction is the best background here. Workflows are predefined code paths; agents dynamically direct their own process. Anthropic explicitly says workflows give predictability and consistency for well-defined tasks, while agents remain better when flexibility and model-driven decision-making are needed. (Anthropic)

That maps cleanly onto ORCA:

Above ORCA : planning, interpretation, strategy selection, exception handling under novelty
Inside ORCA : structured execution, policy application, retry/resume behavior, state transitions
Below ORCA : tool invocations and model-backed implementations

Your own code supports exactly that hybrid reading. official_services/scaffold_service.py is especially revealing: it is explicitly “binding-first,” routing planning through the runtime capability executor when possible, then validating the resulting YAML against the runtime skill schema. That is not “no prompts.” It is “prompts where they belong, execution where it belongs.” (GitHub)

So my detailed feedback here is:

Yes , declarative execution can realistically replace prompt pipelines for structured workflow control.
No , it should not be sold as replacing planning or all model-mediated reasoning.
Your current implementation already proves the first part , because the runtime machinery is real and rich.
Your paper should lean harder into the hybrid claim , because that is the strongest true version of the argument. (GitHub)

3. How this abstraction would behave in more complex, real-world agent systems

My answer is: it gets more valuable and more exposed at the same time.

It gets more valuable because real systems care about exactly the things ORCA already externalizes:

resumability
explicit control flow
backend substitution
approvals and policy gating
auditability
observability
structured state
separation between contract and implementation. (GitHub)

It gets more exposed because once those concerns are explicit, the hard problems move away from prompt engineering and into runtime engineering. That means:

side-effect safety
retry semantics
idempotency
deterministic replay
checkpoint correctness
schema drift
capability versioning
trust and authorization boundaries

LangGraph’s durable-execution guidance is directly relevant here. It says that with checkpoints you can pause and resume workflows after interruptions or failures, but to make that work reliably, workflows should be deterministic and idempotent, and side effects or non-deterministic operations should be isolated inside tasks. It also explicitly recommends separating multiple side-effecting operations so they are not accidentally repeated on resume. (LangChain Docs)

That point lands directly on ORCA’s current implementation.

Your runtime already has much of the “hard systems” machinery:

checkpoint/restore in the runtime surface
retry-aware OpenAPI invocation with transient status handling, max retries, and backoff
step-level retry and routing
webhook delivery and observability surfaces exposed in the architecture
binding resolution and fallback chains. (GitHub)

But the current capability contract still appears less mature than the execution runtime. In runtime/models.py, capability semantics are still represented through generic properties and safety dict-style fields rather than a strongly typed set of mandatory operational declarations. In practice, that means the runtime already knows how to retry, pause, resume, gate, and route, but it does not yet force every capability to declare enough truth for those behaviors to be fully principled. (GitHub)

That is the biggest implementation-based risk in real systems.

What will go well

This abstraction should behave very well in environments that are:

repetitive enough to justify formalization
tool-rich
side-effecting
policy-sensitive
auditable
long-running or interruptible

Those are the exact cases where hidden prompt logic becomes fragile and expensive, and where ORCA’s explicit execution model becomes valuable. Anthropic’s production-oriented framing supports that: workflows are especially useful for well-defined tasks where predictability matters. (Anthropic)

What will strain first

The first things that will strain are not the scheduler or the DAG model. They are:

1. Side-effect semantics If a capability mutates external state, the runtime needs to know whether retries are safe and whether resume can repeat the action. LangGraph’s documentation makes that requirement explicit for durable execution. (LangChain Docs)

2. Trust and consent boundaries MCP is explicit that tools are arbitrary code-execution surfaces, tool descriptions should be treated cautiously unless trusted, and hosts must obtain explicit user consent before data exposure or tool invocation. Since ORCA sits above tools and exposes MCP, it inherits part of that responsibility at the capability layer. (Model Context Protocol)

3. Evaluation maturity OpenAI’s current agent-evals guidance says to start with traces when debugging behavior and to use trace grading to score workflow-level issues such as tool choice, handoffs, routing, and safety violations. ORCA is well positioned for that because it already structures runs into explicit steps and state. But the value of that structure will be much higher if each capability declares its operational semantics more explicitly. (OpenAI Developers)

The most important next step for real-world readiness

Based on the current implementation, the most important next step is not “more architecture.” It is stronger capability-level operational metadata.

I would make these mandatory in the contract layer:

side_effect_class
idempotency
retry_policy
cache_policy
confirmation_policy
compensation_strategy
sensitivity_class
execution_class
cost_latency_class

Why these? Because the runtime is already advanced enough to use them. You already have retry loops, step controls, scaffolding, validation, fallback resolution, and checkpoint-oriented execution behavior. The missing piece is turning operational truth from scattered conventions into enforced per-capability declarations. (GitHub)

Bottom line

So, based on the current implementation:

1. Capability granularity

Go as fine as the smallest governable, reusable, evaluable execution unit. Stop well before micro-reasoning granularity. Your existing step-control machinery already gives you better places than the capability layer to encode fine execution logic. (GitHub)

2. Declarative execution vs prompt pipelines

Yes, declarative execution can realistically replace prompt pipelines for structured workflow control. Your current runtime already demonstrates that. But it should be framed as replacing prompt-driven control flow, not all prompting. Planning and some leaf implementations should remain model-mediated. (GitHub)

3. Behavior in real-world systems

This abstraction should become more valuable as workflows become longer, riskier, and more operationally serious. The current implementation already has many of the right bones for that. The main remaining gap is that the runtime is ahead of the capability-contract semantics, especially around side effects, idempotency, retries, and trust boundaries. (GitHub)

The strongest single sentence I can give you is this:

The current codebase already makes ORCA look like a credible execution runtime; the next thing that will determine whether it scales cleanly is whether each capability tells the runtime enough operational truth to make retries, resumes, fallbacks, approvals, and policies reliable by construction. (GitHub)