Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigyinykik4kjvios3qqcpmo5u5germxxllffkwpwrp4wlf6eo3qai",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlcitaenbxa2"
  },
  "path": "/t/self-improving-agents-via-scheduled-reflection-anthropics-dreaming-architecture/175837#post_1",
  "publishedAt": "2026-05-07T23:27:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Anthropic: New in Claude Managed Agents",
    "Anthropic Engineering: Decoupling the Brain from the Hands"
  ],
  "textContent": "On May 6, 2025, Anthropic shipped three new capabilities for Managed Agents: Dreaming (research preview), Outcomes (public beta), and multi-agent orchestration (public beta). This note focuses on the architectural implications of Dreaming and Outcomes as a coupled feedback mechanism, with attention to what distinguishes this approach from existing memory and evaluation patterns.\n\n* * *\n\n## The Core Problem: Cross-Session Blind Spots\n\nStandard agent architectures are stateless by design. Each session begins from a fixed system prompt plus whatever context is explicitly passed. The agent has no visibility into patterns that emerge across sessions unless that visibility is explicitly engineered.\n\nExisting approaches to persistent agent memory fall into a few categories:\n\n  * **Explicit tool-based writes** : the agent calls `memory.write()` when instructed\n\n  * **End-of-session summarization** : a summary is generated and prepended to future sessions\n\n  * **RAG over interaction history** : past sessions are embedded and retrieved at query time\n\n\n\n\nAll of these are reactive. The agent records what it’s told to record, or retrieves what’s explicitly queried. None of them surface emergent patterns across sessions without human-defined retrieval logic.\n\nDreaming addresses a different problem: **proactive pattern extraction from session history without explicit instruction**.\n\n* * *\n\n## Dreaming: Architecture\n\n### What it is\n\nDreaming is a scheduled background process that runs between sessions. The agent reviews past conversation transcripts, identifies recurring patterns, and writes learnings into its memory store. The original session data is not modified.\n\nThree pattern types are targeted:\n\n  * Recurring errors of the same type\n\n  * Approaches that consistently produced good outcomes\n\n  * Edge cases the agent systematically missed\n\n\n\n\n### Why cross-session visibility matters\n\nA single session cannot observe cross-session patterns. A support agent making the same classification error 12 times in a month has no mechanism to notice this — each session starts fresh. Dreaming surfaces exactly this class of signal.\n\nThis is structurally similar to the role of sleep-phase memory consolidation in biological systems: individual experiences are processed in isolation during acquisition, but pattern extraction and long-term storage happen in a separate, offline phase.\n\n### Autonomy modes\n\n\n    Automatic:\n      analysis → direct write to memory store\n      (no human in the loop)\n\n    Human Review:\n      analysis → proposed memory updates\n      → human approval\n      → apply on confirmation\n\n\n\nThe choice between modes is an architectural decision about acceptable autonomy level. Automatic is appropriate for well-bounded domains with predictable error patterns. Human Review is appropriate where unintended learning could have significant downstream consequences.\n\n### Memory as accumulated deployment context\n\nAn agent that has been running for three months and an agent deployed today with an identical prompt are different systems. The former has three months of self-curated experience in its memory store. This is not model fine-tuning — it’s dynamically updated context specific to a particular deployment instance.\n\nThe implication: the competitive advantage of an agent is no longer solely its prompt or its base model. It’s the history of what it has learned from its own operation.\n\n* * *\n\n## Outcomes: Isolated Evaluation\n\n### The signal problem\n\nDreaming requires a quality signal. Without it, the agent cannot distinguish good outputs from bad when analyzing session history. Outcomes provides this signal through an **isolated evaluator**.\n\n### Architecture\n\n**Success rubric** : defined by the developer. Can include objective criteria (file structure, required fields, format compliance) or subjective criteria (editorial voice, brand consistency, writing style).\n\n**Isolated evaluator** : a separate Claude instance running in its own context window, isolated from the primary agent’s reasoning chain. This isolation is architecturally significant: the evaluator has no access to the agent’s chain-of-thought, preventing rationalization bias in evaluation.\n\n**Iteration loop** :\n\n\n    agent generates output\n        ↓\n    evaluator checks against rubric\n        ↓\n    pass → done\n    fail → evaluator identifies what needs to change\n        ↓\n    agent iterates\n        ↓\n    repeat until pass or max iterations\n\n\n\n### Performance numbers\n\nFrom Anthropic’s internal testing:\n\n  * Task success rates: **+10 percentage points** over standard prompting\n\n  * Structured file generation: .docx **+8.4%** , .pptx **+10.1%**\n\n  * Applicable to subjective quality dimensions: editorial voice, style, brand consistency\n\n\n\n\nThese numbers are from Anthropic’s internal benchmarks. Results on specific tasks will depend heavily on rubric quality and task characteristics.\n\n### The Dreaming + Outcomes coupling\n\n\n    Outcomes → identifies failures (what didn't work)\n    Dreaming → remembers failure patterns (why it didn't work)\n\n\n\nTogether they close the feedback loop without human intervention at each cycle. Outcomes is the exam; Dreaming is the error notebook. The combination enables a self-improvement loop that operates autonomously between sessions.\n\n* * *\n\n## Multi-Agent Orchestration: Topology\n\n### Structure\n\n\n    Coordinator agent (1 instance)\n        ├── Subagent 1 (independent context window)\n        ├── Subagent 2 (independent context window)\n        ├── ...\n        └── Subagent N (up to 20, shared filesystem)\n\n\n\n### Key constraints\n\n  * **Orchestration depth: 1 level.** Sub-subagents are not supported. This is a deliberate constraint that simplifies tracing and debugging.\n\n  * **Claude models only.** Orchestration, Dreaming, and Outcomes grading all run on Claude. Cross-provider routing is not supported at this layer.\n\n  * **Shared filesystem** as the coordination mechanism between subagents.\n\n  * **Full trace visibility** in Claude Console.\n\n  * Coordinator can send follow-up messages mid-workflow; subagents retain context between exchanges.\n\n\n\n\n### Infrastructure ownership\n\nAnthropic handles process management, failure recovery, context synchronization, and timeout handling. The developer defines what each agent does; Anthropic manages how it runs.\n\n### Reported results\n\n  * Harvey (legal AI): task completion rates increased approximately 6x\n\n  * Wisedocs (document verification): review speed improved 50% while maintaining quality standards\n\n  * Netflix: parallel batch analysis across hundreds of build logs\n\n  * Spiral by Every: Haiku coordinator + Opus writing subagents + Outcomes grader scoring against editorial principles\n\n\n\n\n* * *\n\n## The Complete Self-Improvement Loop\n\nThe three capabilities compose into a closed loop:\n\n\n    Task decomposition (orchestration)\n        ↓\n    Execution\n        ↓\n    Output evaluation (Outcomes)\n        ↓\n    Cross-session pattern extraction (Dreaming)\n        ↓\n    Applied to future sessions\n\n\n\nThis is the architectural shift from **stateless tool** to **accumulating system**. Each component addresses a distinct layer:\n\nLayer | Component | Function\n---|---|---\nExecution | Multi-agent orchestration | Parallel task decomposition and delegation\nEvaluation | Outcomes | Isolated quality grading against developer rubrics\nReflection | Dreaming | Scheduled cross-session pattern extraction\nNotification | Webhooks | Push notifications on task completion\n\n* * *\n\n## Limitations and Open Questions\n\n### Claude-only constraint\n\nAll components — orchestration, Dreaming, Outcomes grading — run exclusively on Claude models. Systems requiring model diversity for cost optimization, specialized capabilities, or latency requirements need to solve that routing layer separately.\n\n### Dreaming is research preview\n\nNot GA. Production integration planning should account for potential API changes.\n\n### Orchestration depth limit\n\nSub-subagents are not currently supported. Complex hierarchical task decomposition requiring more than one level of delegation is a design constraint.\n\n### Autonomous memory update risks\n\nIn Automatic mode, the agent can learn in unintended directions. Human Review mode exists as a mitigation, but at scale, human review becomes a bottleneck.\n\n### Open questions not yet addressed in public documentation\n\n  1. **Dreaming schedule frequency** : configurable or fixed?\n\n  2. **History window** : how many past sessions are analyzed per cycle?\n\n  3. **Memory conflict resolution** : how are contradictions between new and existing memory entries handled?\n\n  4. **Multi-tenant isolation** : if one agent serves multiple clients, how is memory isolated per client?\n\n\n\n\nThese questions become critical at production scale.\n\n* * *\n\n## Pricing\n\nStandard Claude API token rates + **$0.08 per active session hour**. Idle time is free. Dreaming, Outcomes, and Webhooks carry no additional charges.\n\n* * *\n\n## Quick Reference\n\nFeature | Status | Function\n---|---|---\nDreaming | Research preview | Scheduled review of past sessions, pattern extraction, memory update\nOutcomes | Public beta | Automated output grading against developer-defined rubrics\nMulti-agent orchestration | Public beta | Coordinator + up to 20 parallel subagents, shared filesystem\nWebhooks | Public beta | Push notifications on agent task completion\nPricing | Live | $0.08/active session hour + standard token costs\n\n* * *\n\n_Sources:_\n\n  * Anthropic: New in Claude Managed Agents\n\n  * Anthropic Engineering: Decoupling the Brain from the Hands\n\n\n\n\n* * *\n\n_Author: Jessie — works on multi-model agent integration infrastructure at EvoLink._",
  "title": "Self-Improving Agents via Scheduled Reflection: Anthropic's Dreaming Architecture"
}