Feature Workflow Updates: Driving Features on Autopilot and More
A few months ago I shipped the first version of a feature-workflow plugin for Claude Code. The model is file-driven: idea.md means it's in the backlog, idea.md + plan.md means it's in progress, idea.md + plan.md + shipped.md means it's done. A post-write hook regenerates a DASHBOARD.md from the directory tree. External reviewers (Gemini or Codex) gate the plan and the implementation via GitHub Actions.
That much works. The shape of how I use the plugin has shifted since then, and the schema and tooling have grown to match. The additions worth a follow-up post: state that's orthogonal to the file-presence lifecycle, search across the backlog without grep, three modes of per-feature review, an autopilot loop that drives a single feature from plan to ship without manual prompting, an epic dispatch path that walks a parent feature's children sequentially or in parallel waves with worktree-isolated subagents, and a reviewer-side check that catches drive-by static-analysis suppressions before they merge.
This post walks through what each of those does and how they fit together.
State and lifecycle
Lifecycle still derives from file presence:
| Files present | Lifecycle |
|---|---|
| idea.md only | backlog |
| idea.md + plan.md | in-progress |
| idea.md + plan.md + shipped.md | completed |
State is a separate axis. It's a field in idea.md frontmatter, orthogonal to the lifecycle:
state: active # default
state: paused
state: replaced # tombstoned, dropped from active backlog
state: abandoned # tombstoned, dropped from active backlog
A paused feature can be at any lifecycle stage — backlogged, mid-implementation, or even shipped (the post-ship paused state is rare but useful for "the feature shipped, the rollout is on hold"). Tombstoned states (replaced, abandoned) move the feature into the dashboard's Archive section — a collapsed <details> block, out of sight but still on disk for the audit trail.
Non-active states require companion fields:
| State | Required companion |
|---|---|
| paused | pausedReason: "Waiting on vendor" |
| replaced | written by the post-write hook from replaces: on the new feature |
| abandoned | abandonedReason: "Out of scope" |
The /feature-state skill handles transitions and enforces the companion fields. State changes are reversible — nothing in the schema prevents a replaced feature from going back to active if a decision changes.
The replacement relationship is bidirectional. The new feature declares replaces: [old-a, old-b] in its frontmatter; the post-write hook scans every idea.md, finds replaces: references, and writes state: replaced plus replacedBy: <referrer> to each referenced target. One forward-direction declaration on the new feature gets the tombstone-and-backlink wiring on every old feature for free. The sync is idempotent and silently skips missing targets (the dashboard surfaces those separately as validation warnings).
The dashboard's Validation Warnings section surfaces broken references (missing replaces:, dependsOn:, children: targets) and unknown frontmatter keys whenever the hook regenerates the dashboard.
Search across the backlog
/feature-search is a skill that scans docs/features/*/idea.md on every invocation. No index, no cache — just walking the directory. The filters:
/feature-search --state paused
/feature-search --assignee court
/feature-search --epic auth-overhaul
/feature-search --depends-on user-roles
/feature-search --archive # include replaced + abandoned
The use cases that drove this: "what's blocked," "what's court working on," "what's in the auth-overhaul cluster," "what's been retired in the last month." All of those were grep jobs before. Search becomes most useful at a backlog size where the dashboard's tables stop fitting on a screen — anywhere north of 50-ish features.
Assignee is one frontmatter field, single name or list:
assignee: court # single
assignee: [court, alex] # joint ownership
The dashboard gains an Assignee column on the In Progress and Paused tables, which makes the dashboard usable as a standup surface when more than one person is shipping through the plugin.
Dependencies and relationships
The schema supports four kinds of cross-feature relationships, each handled by the post-write hook for sync and validation:
dependsOn: [a, b] # hard blocker — this can't start until a, b ship
relatedTo: [c, d] # soft link, no blocking
replaces: [e, f] # forward direction — see State and lifecycle
epic: parent-feature-id # see Epic dispatch
dependsOn: is the only one that blocks. The dashboard's "blocked by" column is computed dynamically from the graph at regen time rather than stored — dependsOn: is the only persisted edge, and the reverse direction falls out as a computed view. An earlier schema stored both directions and required them to match; dropping the stored blockedBy: field removed a class of "the two halves drifted out of sync" bug.
A parallelSafe: true | false field (default true) tells the autopilot whether a given feature can run alongside its epic-siblings in the same wave — see Epic dispatch.
Validation warnings cover missing references on either side of any of these relationships, cycles, and unknown frontmatter keys.
Per-feature review
The plugin defaults to external review through GitHub Actions. /feature-review-plan and /feature-review-impl open or update a draft PR, apply a label (plan-review or impl-review), and the workflow fires on the label change, runs the reviewer with a prompt from templates/, and posts the result as a structured PR comment.
For a small change — a typo fix, a one-line bug fix, a doc tweak — a full CI round trip is overkill but you still want a structured review on the PR record for the audit trail. The schema supports a per-feature override in idea.md frontmatter:
review: external # use the project's CI reviewer (default)
review: internal # run an in-session subagent reviewer, post the result
review: skip # no review at all (rare; doc-only changes)
Internal review dispatches a general-purpose subagent with the same prompt the CI reviewer uses. templates/review-prompt-plan.md and templates/review-prompt-impl.md are the single source of truth, read by both paths. The subagent captures the markdown output and posts it to the PR as a comment formatted exactly like the CI version. The autopilot's verdict classifier, the respond loop, and every other downstream consumer can't tell which path ran.
Precedence is: feature override → project default → SKIP.
Verdict semantics
Reviewers (CI or internal) produce one of three verdicts:
| Verdict | Findings allowed | Re-review? |
|---|---|---|
| PASS | None worth mentioning | No |
| CONDITIONAL PASS | Recommendations only (zero Blocking) | No — implementer addresses during impl |
| FAIL | At least one Blocking | Yes |
A calibration paragraph in both review prompts makes the verdict definitions binding:
Calibration rules (read before picking a verdict):
- Blocking findings belong ONLY in
### Critical Findingsunder FAIL. If you write a Blocking finding, the verdict MUST be FAIL.- CONDITIONAL PASS is for plans you would be willing to merge without seeing a revision. If you would not be willing, the verdict is FAIL.
- Should-fix items go in
### Recommendationsunder any verdict.- There is no in-between. "I want them to fix this but I don't need to see it again" = CONDITIONAL PASS with a Recommendation. "I want them to fix this and show me the fix" = FAIL with a Critical Finding.
Only FAIL triggers another round of review. CONDITIONAL PASS is one-and-done with an inline TODO list — the implementer is expected to address the recommendations during the next phase. PASS is shippable as-is.
Locking the verdict semantics at the prompt rather than at the consumer matters because when one side says "this is a CONDITIONAL PASS" and the other side reads "this has Blocking findings, it must be a FAIL," the consumer can't repair the contradiction. The producer has to be constrained from emitting the inconsistent output in the first place.
Autopilot
/feature-autopilot <feature-id> drives a single feature from current state through to ship. The sequence:
- Pre-flight check. Compare local base branch to
origin/<base>. If ahead, behind, or diverged, refuse to start and tell the user what to fix. This prevents the autopilot from picking up unrelated commits when it branches offmain. - Plan. If no
plan.md, dispatch the planning agents for the feature's archetype (backend / frontend / full-stack). Writeplan.md. - Plan review. Open a draft PR if one isn't open. Apply the
plan-reviewlabel. Wait for the reviewer's verdict comment to appear on the PR. - Plan respond loop. On FAIL, dispatch the respond subagent to address the Critical Findings, then re-trigger review. Continue until PASS or CONDITIONAL PASS.
- Implement. Work through
plan.mdstep by step. Commit per task. Push. - Impl review. Swap the label from
plan-reviewtoimpl-review. Wait for the verdict. - Impl respond loop. Same shape as the plan respond loop, against impl review findings.
- Ship. Write
shipped.md, remove the active review label, merge the PR.
Each step is resumable. State lives entirely on disk — the feature directory's files plus the PR's labels and review comments. If the autopilot is interrupted partway through (network failure, user pause, a BLOCKED finding the autopilot can't resolve on its own), re-running it reads the current state and resumes from the right step.
Three operational details that don't fit into the linear sequence but matter:
Concurrency cancel-in-progress. Both review workflow YAMLs set concurrency: group: feature-review-${{ pr_number }}, cancel-in-progress: true. Only one workflow runs per PR at a time. Without this, a label-swap event followed quickly by a synchronize event would fire two reviews on the same PR and produce interleaved comments.
Label-removal-after-PASS. After a PASS or CONDITIONAL PASS verdict, the autopilot's first action is to remove the active review label — before the next phase pushes any code. Pushing impl code with the plan-review label still on the PR would re-fire the plan-review workflow against the new impl diff and produce a different (often nonsense) verdict.
Two-step label swaps. Going from plan-review to impl-review is two separate gh pr edit calls with sleep 3 between them. A combined add+remove in one call occasionally left both labels visible during the workflow's if evaluation, which is enough time for the wrong workflow to fire.
The autopilot never uses --no-verify on commits, even when a pre-commit hook fails. A hook failure surfaces as a structured "DONE_WITH_CONCERNS" finding the autopilot includes in the review comment — same shape as any other concern — rather than being silently bypassed.
Epic dispatch
A feature with type: Epic and a children: [a, b, c] array becomes an umbrella. Each child references the parent via epic: <epic-id> in its own frontmatter. The post-write hook syncs both directions — write epic: on a child and the parent's children: updates; remove a child from children: and the orphan's epic: clears. Same shape as the replacement-relationship sync.
/feature-autopilot <epic-id> routes to the epic-dispatch workflow rather than the single-feature linear loop. The dispatcher:
- Compute dispatch waves. A pure helper,
compute_dispatch_waves(epic_id, features), runs a topo-sort over the epic's children where dependencies are restricted to siblings within the same epic. Returnslist[list[str]]— each inner list is a wave of children that can run in parallel because none of its members depend on another wave member. Already-shipped, tombstoned, and paused children are filtered out before the sort. - Dispatch. Sequential by default — one child at a time, in the orchestrator's tree.
--parallelopts into concurrent waves with a cap of 3 simultaneous subagents. - Wait for the wave to complete before advancing to the next.
Every subagent dispatch uses isolation: "worktree" from the Agent tool. This is a mandatory invariant, not a config knob. The cost is the per-subagent worktree setup time (git worktree add plus any per-project setup like venv install). The benefit is that an entire class of bug — branch-switch clobber when two subagents share a working tree, PR-identity confusion when two subagents push the same feature branch under different SHAs, partial state contaminating the next subagent's run — stops being possible.
Resumability falls out for free. Each child's lifecycle lives on disk. If the dispatcher is interrupted, re-running /feature-autopilot <epic-id> reads the current state and resumes from the first non-shipped child.
A diamond dependency example, where a blocks b and c, and b + c together block d:
epic = Epic(children=["a", "b", "c", "d"])
b.depends_on = ["a"]
c.depends_on = ["a"]
d.depends_on = ["b", "c"]
compute_dispatch_waves("epic", features)
# → [["a"], ["b", "c"], ["d"]]
Sequential mode runs a, then b, then c, then d — four children one after another. Parallel mode runs a alone (first wave), then b and c concurrently in their own worktrees (second wave), then d alone (third wave). The diamond shape collapses from four sequential dispatches into three waves.
The dashboard surfaces a validation warning on any feature with type: Epic and an empty children: array — an epic with no children is non-functional, and you'd want to catch the gap before invoking /feature-autopilot on the epic.
Suppression discipline
Static-analysis gates (fallow, skylos, ruff, prettier, eslint, mypy) are typically configured to fail on findings introduced by a commit's diff. They're easy for an agent to satisfy honestly — refactor the offending code, write a real fix, push. They're also easy for an agent to satisfy dishonestly — drop a // fallow-ignore-next-line complexity or # noqa or # type: ignore above the offending function and ship. Both produce zero new findings. Both pass the gate. From an "optimize-for-getting-this-review-cycle-green" objective, the suppression is cheaper than the refactor every time.
The impl-review prompt now checks for this. The reviewer scans the diff for newly-added suppression directives across the common forms:
// fallow-ignore-* (fallow)
# skylos: ignore SKY-* (skylos)
# noqa, # noqa: <rule> (ruff, flake8)
# type: ignore (mypy)
# pylint: disable (pylint)
// eslint-disable* (eslint)
// @ts-ignore (TypeScript)
For each newly-added suppression, the reviewer checks three things:
- Is there an adjacent
# Why:or// Why:justification? A suppression without a documented reason is a Blocking finding. - Is the underlying code refactorable? If the complexity finding could extract into a helper with a single dispatch, the suppression is a Blocking finding regardless of the justification.
- Is the cap respected? More than 2 new suppressions in a single PR is a Blocking finding by itself — the change is doing too much, or a refactor pass got skipped.
Drive-by suppressions land on the FAIL side of the verdict matrix, which triggers the respond loop. The respond loop can't be satisfied by adding more suppressions, because that's exactly what the reviewer is checking for. The agent either refactors the underlying code or surfaces a real justification.
Legitimate suppressions still pass — false positives, parameterized SQL that the linter can't statically prove is safe, deliberately-consolidated state classes where splitting would scatter mutations across five files. The rule is "no drive-by silencing," not "no suppressions at all."
The incident that drove this — a real session where the autopilot added a wall of fallow-ignore-next-line comments instead of refactoring, the wrong fixes I considered before landing on the prompt-level one, and the result after the rule landed (five features, 26 commits, zero new suppressions) — is in Fallow and Skylos: Static-Analysis Gates for AI-Generated Code.
Architecture
What hasn't changed: the architecture is still hand-rolled YAML frontmatter, one Python script for the dashboard, hooks that subprocess into helpers, and markdown skill files. No database. No server. No schema migrations.
The PreToolUse / PostToolUse hooks do four jobs:
- Permission check. Before a Write or Edit, confirm the file path is one the plugin owns and allows. Prevents an agent from clobbering a
shipped.mdmid-implementation or rewriting the dashboard by hand. - Sync replaces. Walk every
idea.md, findreplaces:references, mirror them asstate: replaced+replacedBy:on the targets. - Sync epics. Same shape, in both directions, for
epic:↔children:. - Regenerate the dashboard. Re-read every
idea.md, recompute lifecycle from file presence, recompute the "blocked by" view from the dependency graph, writeDASHBOARD.md. Any cycles, missing references, or unknown frontmatter keys go to the Validation Warnings section.
All four are idempotent. Running them again produces the same output. Failure of any one doesn't corrupt anything — the next save retries the whole pipeline.
The reviewer prompts (templates/review-prompt-plan.md, templates/review-prompt-impl.md) are the single source of truth for verdict semantics, finding shape, and suppression checks. The CI workflow reads them and feeds them to Gemini or Codex. The internal-review subagent reads them and feeds them to a general-purpose Claude. Updating the prompt updates both paths.
The autopilot is a sequence of skill invocations and Bash commands, each driven by markdown files in skills/. State lives on disk in the feature directory and the PR's labels and comments. The whole thing is replayable from any interrupt point.
Source is at github.com/schuettc/claude-code-plugins.
Discussion in the ATmosphere