ORCA: A Cognitive Runtime Layer for Agent Systems (paper + open source)
Since the last round of discussion, we ran a controlled experiment comparing single-prompt execution against ORCA’s multi-step skill orchestration on two tasks (structured decision-making and multi-step text processing). 10 inputs per task, same model (gpt-4o-mini), fixed seed.
The numbers are honest:
| Dimension | Prompt-based | ORCA Structured |
|---|---|---|
| Latency | Lower (1 LLM call) | Higher (N sequential calls) |
| Traceability | None | Full step-level trace |
| Reusability | None | Full capability reuse |
| Maintainability | Low (monolithic) | High (declarative YAML) |
| Variability | Low | Low-moderate |
ORCA is not faster for simple one-off tasks. That’s not the point.
The point is what happens when you need to audit what your agent did, swap a backend without rewriting the workflow, reuse a step across 15 different skills, or resume a failed run from a checkpoint.
Prompt-based execution gives you none of that. Not because the prompt was bad — because the architecture doesn’t support it.
Full benchmark code and results are in the repo: run_benchmark.py
Discussion in the ATmosphere