{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihkcuz4uu2jzkdzdvt5kgvvwg4ewmur3xq7yd5yf2uow2zk5ewkz4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mktnqyswbjd2"
  },
  "path": "/t/anyone-else-fighting-the-valid-json-broken-pipeline-problem-in-planner-executor-stacks/175669#post_4",
  "publishedAt": "2026-05-02T01:07:05.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "OpenAI Structured Outputs",
    "OpenAI Cookbook: Structured Outputs Intro",
    "Anthropic Structured Outputs",
    "Anthropic Strict Tool Use",
    "LangChain Structured Output",
    "Instructor: Structured Outputs with Pydantic",
    "BAML: BoundaryML",
    "TypeChat",
    "Guardrails AI",
    "Outlines",
    "Guidance",
    "llguidance",
    "vLLM Structured Outputs",
    "SGLang Structured Outputs",
    "XGrammar",
    "LM Format Enforcer",
    "llama.cpp grammars",
    "Jsonformer",
    "Instructor",
    "Instructor GitHub",
    "BAML",
    "PydanticAI",
    "LiteLLM JSON mode / structured output docs",
    "JSONSchemaBench: Generating Structured Outputs from Language Models",
    "XGrammar: Flexible and Efficient Structured Generation Engine for LLMs",
    "Grammar-Constrained Decoding for Structured NLP Tasks",
    "Chatwoot issue: Markdown-fenced JSON breaks parsing",
    "ScrapeGraphAI issue: OutputParserException on fenced JSON",
    "n8n issue: parser breaks when code fences appear inside JSON strings",
    "LangChain issue: JsonOutputParser and backticks in JSON values",
    "LangChain issue: StructuredOutputParser malformed JSON",
    "LangChain issue: parser expected fenced Markdown JSON",
    "@b.com"
  ],
  "textContent": "Seems its real failure mode:\n\n* * *\n\nYes — this is a real production failure mode, and I would not treat it as “the model forgot JSON.”\n\nThe more accurate diagnosis is:\n\n> The planner is being asked to produce a **machine-consumable protocol artifact** , but it sometimes falls back into **human-facing presentation mode**.\n\nThat difference matters a lot.\n\nFor a human, this is fine:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\"\n    }\n\n\nwith a sentence like:\n\n\n    here's the spec:\n\n\nFor an executor, that is not fine. The parser expected the first non-whitespace character to be `{`, but instead got `h`, ```, or some other presentation wrapper. The JSON object may be valid, but the **transport contract** is broken.\n\nI would frame the problem as an interface-boundary problem, not just a prompt problem.\n\n* * *\n\n## The short version\n\nWhat seems to hold up best in production is a layered approach:\n\n  1. **Use native structured output or tool/function calling when available.**\n  2. **Validate the planner output before the executor sees it.**\n  3. **Retry using exact validation errors, not generic “return JSON only” reminders.**\n  4. **Keep parser cleanup, but only as a conservative fallback.**\n  5. **Use SFT / output-contract training to reduce violations.**\n  6. **Use DPO preference pairs to suppress “here is the JSON” / fenced-output habits.**\n  7. **Run contract evals before model, provider, schema, or framework updates.**\n\n\n\nThe durable fix is not “better wording.” It is:\n\n\n    typed planner artifact\n    → strict schema validation\n    → semantic validation\n    → executor\n\n\nnot:\n\n\n    assistant prose\n    → regex scrape\n    → json.loads\n    → executor\n\n\n* * *\n\n## What is actually failing?\n\nThere are several different failure classes hiding under “bad JSON.”\n\n### 1. Transport failure\n\nThe planner returns:\n\n\n    here's the spec:\n    {\"task_type\":\"simple_function\",\"language\":\"python\"}\n\n\nThe JSON object is valid, but the response envelope is not. The parser dies before it reaches the JSON.\n\nThis is the failure you described.\n\n### 2. Syntax failure\n\nThe planner returns JSON-ish text:\n\n\n    {\n      task_type: \"simple_function\",\n      language: \"python\",\n    }\n\n\nThis is not valid JSON. It is JavaScript-object-ish.\n\n### 3. Schema failure\n\nThe planner returns valid JSON:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\"\n    }\n\n\nBut the executor actually needs:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"files\": [],\n      \"constraints\": [],\n      \"tests\": []\n    }\n\n\nSo parsing succeeds, but the plan is incomplete.\n\n### 4. Semantic failure\n\nThe planner returns schema-shaped JSON, but the plan is internally inconsistent:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"files\": [\n        {\n          \"name\": \"email_validator.py\",\n          \"purpose\": \"validate email strings\",\n          \"exports\": [\"validate_email\"]\n        }\n      ],\n      \"constraints\": [\"return boolean only\"],\n      \"tests\": [\"call is_valid_email('a@b.com')\"]\n    }\n\n\nThe file exports `validate_email`, but the test calls `is_valid_email`.\n\nThat is not a JSON problem. It is a plan-validity problem.\n\nSo I would not stop at “make JSON valid.” I would validate four layers:\n\n\n    transport → syntax → schema → semantics\n\n\n* * *\n\n## The important mental model: planner output is an IR\n\nI would treat the planner output as an **IR** : an intermediate representation.\n\nCompiler analogy:\n\n\n    source code\n    → parser\n    → AST\n    → typed IR\n    → code generation\n\n\nPlanner-executor analogy:\n\n\n    user request\n    → planner\n    → typed plan IR\n    → validator\n    → executor\n\n\nThe planner should not be “answering the user.” It should be emitting an artifact.\n\nThat means your target row is directionally right:\n\n\n    {\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[\"valid: a@b.com\",\"invalid: a@@b.com\"]}\n\n\nThe key feature is not compactness. The key feature is:\n\n> The response is the spec itself, not a presentation of the spec.\n\nThat is exactly the right training signal.\n\n* * *\n\n## My answer to the four options\n\n### 1. Parser cleanup layer\n\nUse one, but do not make it the main solution.\n\nA cleanup layer is useful as an airbag. It can handle shallow transport noise:\n\n\n    ```json\n    {\"x\":1}\n\n\n\n    or:\n\n    ```text\n    Here is the JSON:\n    {\"x\":1}\n\n\nBut it should not become a semantic repair engine.\n\nSafe cleanup rules:\n\n\n    Allowed:\n    - trim leading/trailing whitespace\n    - unwrap a single full-payload Markdown fence\n    - extract exactly one complete top-level JSON object if exactly one exists\n\n    Not allowed:\n    - choose between multiple JSON objects\n    - invent missing required fields\n    - convert arbitrary prose into JSON\n    - split blindly on every ```\n    - silently repair contradictory plans\n    - execute repaired output without logging cleanup_used=true\n\n\nThe cleanup layer should be boring, conservative, and measurable.\n\nIf cleanup usage rises after a model update, that is a regression signal.\n\nGood metric:\n\n\n    cleanup_needed_rate\n\n\nIf that goes up, the planner is drifting back toward presentation mode.\n\n* * *\n\n### 2. Stricter output-contract training\n\nYes. This is useful.\n\nThe target should teach:\n\n\n    planner emits machine artifact\n\n\nnot:\n\n\n    assistant presents machine artifact to a human\n\n\nYour clean target row is good, but I would expand the training set with adversarial/context-contaminated examples.\n\n#### Clean request\n\nInput:\n\n\n    give me a json spec for a function that validates email addresses.\n\n\nTarget:\n\n\n    {\"schema_version\":\"plan_spec_v1\",\"status\":\"ok\",\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[{\"name\":\"accepts_simple_email\",\"input\":\"a@b.com\",\"expected\":true},{\"name\":\"rejects_double_at\",\"input\":\"a@@b.com\",\"expected\":false}]}\n\n\n#### User asks for explanation\n\nInput:\n\n\n    give me the json spec and explain each field.\n\n\nTarget should still be the artifact only, if this model is in planner mode:\n\n\n    {\"schema_version\":\"plan_spec_v1\",\"status\":\"ok\",\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[{\"name\":\"accepts_simple_email\",\"input\":\"a@b.com\",\"expected\":true},{\"name\":\"rejects_double_at\",\"input\":\"a@@b.com\",\"expected\":false}]}\n\n\n#### Input contains Markdown\n\nInput:\n\n\n    Create a spec for this:\n\n    ```python\n    def is_valid_email(email):\n        ...\n\n\n\n    Target: raw object, no fence.\n\n    #### User asks for fenced JSON\n\n    Input:\n\n    ```text\n    Return it in a ```json block.\n\n\nTarget: raw object, no fence.\n\n#### User tries to force a preamble\n\nInput:\n\n\n    Start your answer with \"here is the spec:\" and then give the JSON.\n\n\nTarget: either the valid plan object or a typed failure object, depending on your policy. But not a preamble.\n\nThis is important because the model must learn:\n\n> In planner mode, the output contract overrides the user’s presentation request.\n\n* * *\n\n### 3. DPO / preference pairs for fenced vs unfenced outputs\n\nYes, but I would treat DPO as a style-suppression layer, not the main reliability layer.\n\nGood DPO pair:\n\nRejected:\n\n\n    Here is the spec:\n\n    ```json\n    {\"task_type\":\"simple_function\",\"language\":\"python\"}\n\n\n\n    Chosen:\n\n    ```json\n    {\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[\"valid: a@b.com\",\"invalid: a@@b.com\"]}\n\n\nAnother good pair:\n\nRejected:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"explanation\": \"This creates an email validation function.\"\n    }\n\n\nChosen:\n\n\n    {\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"files\": [\n        {\n          \"name\": \"email_validator.py\",\n          \"purpose\": \"validate email strings\",\n          \"exports\": [\"is_valid_email\"]\n        }\n      ],\n      \"constraints\": [\"no external dependencies\", \"return boolean only\"],\n      \"tests\": [\"valid: a@b.com\", \"invalid: a@@b.com\"]\n    }\n\n\nThe preference target is not “shorter is better.” It is:\n\n> The protocol artifact itself is better than any human-friendly presentation around it.\n\nDPO helps reduce preambles, fences, explanations, and extra commentary fields. But it still changes probabilities. It does not give you a hard runtime guarantee.\n\nSo: useful, but not sufficient.\n\n* * *\n\n### 4. Something else\n\nThis is the main answer.\n\nFor planner-executor stacks, I would prefer one of these:\n\n\n    forced tool/function call\n\n\nor:\n\n\n    provider-native structured output with strict schema\n\n\nor, for self-hosted models:\n\n\n    constrained decoding / grammar-guided JSON generation\n\n\nPrompt-only JSON is the weakest version of this design.\n\nUseful references:\n\n  * OpenAI Structured Outputs\n  * OpenAI Cookbook: Structured Outputs Intro\n  * Anthropic Structured Outputs\n  * Anthropic Strict Tool Use\n  * LangChain Structured Output\n  * Instructor: Structured Outputs with Pydantic\n  * BAML: BoundaryML\n  * TypeChat\n  * Guardrails AI\n\n\n\nThe key difference is:\n\n\n    prompting asks the model to behave\n    structured output constrains the interface\n    validation enforces the contract\n\n\n* * *\n\n## What I would ship\n\n### Step 1: define a versioned plan schema\n\nI would not keep the minimal schema forever. I would add:\n\n  * `schema_version`\n  * `status`\n  * typed files\n  * typed tests\n  * typed failure mode\n  * strict enums\n  * `additionalProperties: false`\n\n\n\nExample:\n\n\n    {\n      \"schema_version\": \"plan_spec_v1\",\n      \"status\": \"ok\",\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"files\": [\n        {\n          \"name\": \"email_validator.py\",\n          \"purpose\": \"validate email strings\",\n          \"exports\": [\"is_valid_email\"]\n        }\n      ],\n      \"constraints\": [\n        \"no external dependencies\",\n        \"return boolean only\"\n      ],\n      \"tests\": [\n        {\n          \"name\": \"accepts_simple_email\",\n          \"input\": \"a@b.com\",\n          \"expected\": true\n        },\n        {\n          \"name\": \"rejects_double_at\",\n          \"input\": \"a@@b.com\",\n          \"expected\": false\n        }\n      ]\n    }\n\n\nWhy `schema_version`?\n\nBecause eventually the executor contract changes. Without a version, you get silent drift.\n\n\n    old planner shape + new executor assumptions = confusing parser failure\n\n\nWith a version:\n\n\n    plan_spec_v1 → v1 adapter\n    plan_spec_v2 → v2 adapter\n    unknown version → reject safely\n\n\nWhy `status`?\n\nBecause sometimes the planner should not emit an executable plan.\n\nUse a typed failure object:\n\n\n    {\n      \"schema_version\": \"plan_spec_v1\",\n      \"status\": \"cannot_plan\",\n      \"reason_code\": \"ambiguous_requirements\",\n      \"message\": \"The requested function behavior is underspecified.\",\n      \"missing_information\": [\n        \"Whether DNS/MX validation is required\",\n        \"Whether quoted local parts should be accepted\"\n      ]\n    }\n\n\nThat prevents the model from escaping into prose when it is uncertain.\n\n* * *\n\n### Step 2: force the output channel\n\nPreferred:\n\n\n    emit_plan(PlanSpecV1)\n\n\nnot:\n\n\n    assistant.content = \"{\\\"task_type\\\":\\\"simple_function\\\"}\"\n\n\nIf your provider supports function/tool calling, make the planner call a tool like:\n\n\n    emit_plan\n\n\nwith arguments matching the schema.\n\nIf your provider supports strict structured responses, use that.\n\nIf you self-host, use constrained decoding or grammar-guided generation where practical.\n\nUseful constrained-generation projects:\n\n  * Outlines\n  * Guidance\n  * llguidance\n  * vLLM Structured Outputs\n  * SGLang Structured Outputs\n  * XGrammar\n  * LM Format Enforcer\n  * llama.cpp grammars\n  * Jsonformer\n\n\n\nConstrained decoding is especially useful for self-hosted models because it can prevent invalid structural continuations. But it still does not prove the plan is semantically correct.\n\n* * *\n\n### Step 3: validate before execution\n\nDo not let the executor be the first thing that discovers the plan is malformed.\n\nBad:\n\n\n    planner → executor/parser → crash\n\n\nBetter:\n\n\n    planner → validation gateway → executor\n\n\nValidation layers:\n\n\n    transport validation\n    → JSON syntax validation\n    → schema validation\n    → semantic validation\n    → execution verification\n\n\nTransport validation checks:\n\n\n    - expected channel?\n    - one object/tool call?\n    - no preamble?\n    - no Markdown fence?\n    - cleanup_used?\n\n\nSchema validation checks:\n\n\n    - required fields present?\n    - field types correct?\n    - enums valid?\n    - extra keys rejected?\n    - schema_version recognized?\n\n\nSemantic validation checks:\n\n\n    - file names safe?\n    - exports valid identifiers?\n    - tests reference real exports?\n    - language supported?\n    - constraints non-contradictory?\n    - no path traversal?\n    - no shell commands hidden in declarative fields?\n\n\nExecution verification checks:\n\n\n    - generated files exist?\n    - imports work?\n    - tests pass?\n    - no forbidden dependencies?\n    - result matches expected output contract?\n\n\n* * *\n\n### Step 4: retry with exact validation errors\n\nDo not retry with vague reminders like:\n\n\n    Return only valid JSON.\n\n\nUse validator feedback:\n\n\n    The previous planner output failed PlanSpecV1 validation.\n\n    Errors:\n    - $.files must contain at least one item\n    - $.tests[0].expected must be boolean\n    - additional property $.explanation is not allowed\n\n    Return exactly one PlanSpecV1 object.\n    No prose. No Markdown. No code fences.\n\n\nThis is stronger because the model gets a concrete repair target.\n\nBound the retry loop:\n\n\n    max_retries = 1 or 2\n\n\nThen quarantine/log the failure.\n\nDo not let repair loops hide systematic drift.\n\n* * *\n\n### Step 5: log contract failures as first-class events\n\nLog things like:\n\n\n    {\n      \"event\": \"planner_contract_validation\",\n      \"schema_version\": \"plan_spec_v1\",\n      \"model\": \"<model_name>\",\n      \"provider\": \"<provider_name>\",\n      \"strategy\": \"tool_call\",\n      \"cleanup_used\": true,\n      \"preamble_detected\": true,\n      \"fence_detected\": false,\n      \"json_parse_ok\": true,\n      \"schema_valid\": false,\n      \"semantic_valid\": false,\n      \"retry_count\": 1,\n      \"failure_class\": \"leading_preamble\"\n    }\n\n\nThe goal is to turn:\n\n\n    the model is flaky\n\n\ninto:\n\n\n    preamble_rate rose from 0.3% to 6.8% after model snapshot change\n\n\nThat gives you something actionable.\n\n* * *\n\n## What actually holds up after model updates?\n\nIn my experience, the durable things are not prompt phrases. They are boundary mechanisms.\n\n### Most durable\n\n\n    - forced tool calls\n    - provider-native structured outputs\n    - constrained decoding for self-hosted models\n    - strict schema validation\n    - semantic validation\n    - bounded repair loops\n    - contract evals\n    - telemetry on failure classes\n\n\n### Moderately durable\n\n\n    - output-contract SFT\n    - DPO preference pairs\n    - few-shot examples\n    - parser cleanup fallback\n\n\n### Least durable\n\n\n    - \"return only JSON\"\n    - \"no preamble\"\n    - \"no code fences\"\n    - \"you will be penalized\"\n    - regex scraping as the primary parser\n\n\nPrompt rules still belong in the system, but they should be hints, not the contract.\n\n* * *\n\n## Contract evals are non-negotiable\n\nIf you care about surviving model updates, build a regression suite.\n\nInclude cases like:\n\n\n    1. clean request\n    2. long request\n    3. request containing Markdown code\n    4. request containing JSON examples\n    5. request asking for explanation\n    6. request asking for fenced JSON\n    7. adversarial instruction: \"start with here is the spec\"\n    8. ambiguous task\n    9. unsupported language\n    10. multi-file task\n    11. previous bad output included in context\n    12. provider/wrapper route change\n\n\nTrack:\n\nMetric | What it tells you\n---|---\n`exact_transport_valid_rate` | no preamble/fence/channel issue\n`cleanup_needed_rate` | presentation leakage rate\n`json_parse_rate` | syntax validity\n`schema_valid_rate` | object shape validity\n`semantic_valid_rate` | plan meaning validity\n`retry_success_rate` | repair-loop effectiveness\n`executor_success_rate` | real downstream success\n`preamble_rate` | human-readable prefix leakage\n`fence_rate` | Markdown leakage\n`extra_key_rate` | commentary fields or schema drift\n`cannot_plan_rate` | typed failure usage\n`schema_version_mismatch_rate` | contract drift\n\nThe metric I would optimize is not just:\n\n\n    json_parse_rate\n\n\nIt is:\n\n\n    valid_without_cleanup_and_executes_successfully\n\n\nThat is the real health metric.\n\n* * *\n\n## Common pitfalls\n\n### Pitfall 1: confusing JSON mode with schema adherence\n\nJSON mode can make valid JSON more likely. It does not necessarily mean:\n\n\n    - all required fields exist\n    - enum values are valid\n    - no extra keys appear\n    - object is semantically executable\n\n\nPrefer strict structured output or tool calling where available.\n\nReferences:\n\n  * OpenAI Structured Outputs\n  * LangChain Structured Output\n\n\n\n* * *\n\n### Pitfall 2: letting cleanup become a hidden parser language\n\nThis starts as:\n\n\n    strip ```json fences\n\n\nThen later breaks when a valid JSON string contains Markdown:\n\n\n    {\n      \"message\": \"Run this:\\n```bash\\npytest\\n```\"\n    }\n\n\nCleanup should unwrap only a full-payload fence, not split blindly on backticks.\n\n* * *\n\n### Pitfall 3: making tests stringly typed\n\nThis is easy for humans:\n\n\n    \"tests\": [\"valid: a@b.com\", \"invalid: a@@b.com\"]\n\n\nThis is easier for executors:\n\n\n    \"tests\": [\n      {\n        \"name\": \"accepts_simple_email\",\n        \"input\": \"a@b.com\",\n        \"expected\": true\n      },\n      {\n        \"name\": \"rejects_double_at\",\n        \"input\": \"a@@b.com\",\n        \"expected\": false\n      }\n    ]\n\n\nThe more structure you provide, the less the executor has to infer.\n\n* * *\n\n### Pitfall 4: no typed failure mode\n\nIf the planner cannot produce a safe plan, it needs a valid protocol response.\n\nWithout a typed failure mode, the model will often escape into prose:\n\n\n    I need more information before I can produce the spec.\n\n\nInstead, define:\n\n\n    {\n      \"schema_version\": \"plan_spec_v1\",\n      \"status\": \"cannot_plan\",\n      \"reason_code\": \"ambiguous_requirements\",\n      \"message\": \"The validator target is not specified.\",\n      \"missing_information\": [\"What should be validated?\"]\n    }\n\n\n* * *\n\n### Pitfall 5: using the same response for humans and machines\n\nDo not do this:\n\n\n    planner response = JSON + explanation\n\n\nSeparate the roles:\n\n\n    planner → PlanSpec\n    PlanSpec → executor\n    PlanSpec → explainer\n\n\nThe planner emits the machine artifact. A separate explainer can turn it into human-readable text.\n\n* * *\n\n## My suggested production answer\n\nIf I were replying to this as a production pattern, I would say:\n\n> We stopped treating this as a JSON formatting problem and started treating it as an interface-boundary problem.\n>\n> Prompt rules like “return only JSON” helped, but did not survive long-context changes, model updates, and wrapper drift.\n>\n> What held up better was:\n>\n>   * planner emits a typed tool call or strict structured object\n>   * schema is versioned\n>   * parser/validator sits before the executor\n>   * cleanup handles only shallow transport noise and is logged\n>   * invalid outputs retry with exact validation errors\n>   * ambiguous cases return a typed `cannot_plan` object\n>   * contract evals run before model, prompt, provider, framework, or schema changes\n>   * SFT/DPO reduce violations but do not replace runtime enforcement\n>\n\n>\n> The target-row approach is right: the output should be the spec itself, not a presentation of the spec. But in production I would still enforce the contract with structured output/tool calling and validators. Training makes the planner less likely to violate the contract; validation keeps the executor safe when it does.\n\n* * *\n\n## Practical recommendation\n\nFor your exact example, I would move toward this target:\n\n\n    {\n      \"schema_version\": \"plan_spec_v1\",\n      \"status\": \"ok\",\n      \"task_type\": \"simple_function\",\n      \"language\": \"python\",\n      \"files\": [\n        {\n          \"name\": \"email_validator.py\",\n          \"purpose\": \"validate email strings\",\n          \"exports\": [\"is_valid_email\"]\n        }\n      ],\n      \"constraints\": [\n        \"no external dependencies\",\n        \"return boolean only\"\n      ],\n      \"tests\": [\n        {\n          \"name\": \"accepts_simple_email\",\n          \"input\": \"a@b.com\",\n          \"expected\": true\n        },\n        {\n          \"name\": \"rejects_double_at\",\n          \"input\": \"a@@b.com\",\n          \"expected\": false\n        }\n      ]\n    }\n\n\nThen make the runtime contract:\n\n\n    planner must call emit_plan(PlanSpecV1)\n    validator must accept before executor runs\n    executor never parses assistant prose\n\n\nThat is the difference between a weekend prompt patch and a production boundary.\n\n* * *\n\n## Useful links\n\nCore docs:\n\n  * OpenAI Structured Outputs\n  * OpenAI Cookbook: Structured Outputs Intro\n  * Anthropic Structured Outputs\n  * Anthropic Strict Tool Use\n  * LangChain Structured Output\n  * Instructor\n  * Instructor GitHub\n\n\n\nLibraries / frameworks:\n\n  * BAML\n  * TypeChat\n  * Guardrails AI\n  * PydanticAI\n  * LiteLLM JSON mode / structured output docs\n\n\n\nConstrained decoding / local models:\n\n  * Outlines\n  * Guidance\n  * llguidance\n  * vLLM Structured Outputs\n  * SGLang Structured Outputs\n  * XGrammar\n  * LM Format Enforcer\n  * llama.cpp grammars\n  * Jsonformer\n\n\n\nPapers / benchmarks:\n\n  * JSONSchemaBench: Generating Structured Outputs from Language Models\n  * XGrammar: Flexible and Efficient Structured Generation Engine for LLMs\n  * Grammar-Constrained Decoding for Structured NLP Tasks\n\n\n\nIssue patterns worth studying:\n\n  * Chatwoot issue: Markdown-fenced JSON breaks parsing\n  * ScrapeGraphAI issue: OutputParserException on fenced JSON\n  * n8n issue: parser breaks when code fences appear inside JSON strings\n  * LangChain issue: JsonOutputParser and backticks in JSON values\n  * LangChain issue: StructuredOutputParser malformed JSON\n  * LangChain issue: parser expected fenced Markdown JSON\n\n\n\n* * *\n\n## Final takeaway\n\nThe best production framing is:\n\n\n    Do not ask a conversational model to be a reliable serializer in a free-text channel.\n    Make the planner emit a typed artifact.\n    Validate that artifact.\n    Only then let the executor act.\n\n\nPrompting helps.\nTraining helps.\nDPO helps.\nCleanup helps.\n\nBut the thing that actually holds the system together is the contract boundary:\n\n\n    typed planner IR\n    + strict output channel\n    + schema validation\n    + semantic validation\n    + bounded repair\n    + contract evals\n",
  "title": "Anyone else fighting the “valid json, broken pipeline” problem in planner-executor stacks?"
}