{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihkcuz4uu2jzkdzdvt5kgvvwg4ewmur3xq7yd5yf2uow2zk5ewkz4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mktnqyswbjd2"
},
"path": "/t/anyone-else-fighting-the-valid-json-broken-pipeline-problem-in-planner-executor-stacks/175669#post_4",
"publishedAt": "2026-05-02T01:07:05.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"OpenAI Structured Outputs",
"OpenAI Cookbook: Structured Outputs Intro",
"Anthropic Structured Outputs",
"Anthropic Strict Tool Use",
"LangChain Structured Output",
"Instructor: Structured Outputs with Pydantic",
"BAML: BoundaryML",
"TypeChat",
"Guardrails AI",
"Outlines",
"Guidance",
"llguidance",
"vLLM Structured Outputs",
"SGLang Structured Outputs",
"XGrammar",
"LM Format Enforcer",
"llama.cpp grammars",
"Jsonformer",
"Instructor",
"Instructor GitHub",
"BAML",
"PydanticAI",
"LiteLLM JSON mode / structured output docs",
"JSONSchemaBench: Generating Structured Outputs from Language Models",
"XGrammar: Flexible and Efficient Structured Generation Engine for LLMs",
"Grammar-Constrained Decoding for Structured NLP Tasks",
"Chatwoot issue: Markdown-fenced JSON breaks parsing",
"ScrapeGraphAI issue: OutputParserException on fenced JSON",
"n8n issue: parser breaks when code fences appear inside JSON strings",
"LangChain issue: JsonOutputParser and backticks in JSON values",
"LangChain issue: StructuredOutputParser malformed JSON",
"LangChain issue: parser expected fenced Markdown JSON",
"@b.com"
],
"textContent": "Seems its real failure mode:\n\n* * *\n\nYes — this is a real production failure mode, and I would not treat it as “the model forgot JSON.”\n\nThe more accurate diagnosis is:\n\n> The planner is being asked to produce a **machine-consumable protocol artifact** , but it sometimes falls back into **human-facing presentation mode**.\n\nThat difference matters a lot.\n\nFor a human, this is fine:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\"\n }\n\n\nwith a sentence like:\n\n\n here's the spec:\n\n\nFor an executor, that is not fine. The parser expected the first non-whitespace character to be `{`, but instead got `h`, ```, or some other presentation wrapper. The JSON object may be valid, but the **transport contract** is broken.\n\nI would frame the problem as an interface-boundary problem, not just a prompt problem.\n\n* * *\n\n## The short version\n\nWhat seems to hold up best in production is a layered approach:\n\n 1. **Use native structured output or tool/function calling when available.**\n 2. **Validate the planner output before the executor sees it.**\n 3. **Retry using exact validation errors, not generic “return JSON only” reminders.**\n 4. **Keep parser cleanup, but only as a conservative fallback.**\n 5. **Use SFT / output-contract training to reduce violations.**\n 6. **Use DPO preference pairs to suppress “here is the JSON” / fenced-output habits.**\n 7. **Run contract evals before model, provider, schema, or framework updates.**\n\n\n\nThe durable fix is not “better wording.” It is:\n\n\n typed planner artifact\n → strict schema validation\n → semantic validation\n → executor\n\n\nnot:\n\n\n assistant prose\n → regex scrape\n → json.loads\n → executor\n\n\n* * *\n\n## What is actually failing?\n\nThere are several different failure classes hiding under “bad JSON.”\n\n### 1. Transport failure\n\nThe planner returns:\n\n\n here's the spec:\n {\"task_type\":\"simple_function\",\"language\":\"python\"}\n\n\nThe JSON object is valid, but the response envelope is not. The parser dies before it reaches the JSON.\n\nThis is the failure you described.\n\n### 2. Syntax failure\n\nThe planner returns JSON-ish text:\n\n\n {\n task_type: \"simple_function\",\n language: \"python\",\n }\n\n\nThis is not valid JSON. It is JavaScript-object-ish.\n\n### 3. Schema failure\n\nThe planner returns valid JSON:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\"\n }\n\n\nBut the executor actually needs:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"files\": [],\n \"constraints\": [],\n \"tests\": []\n }\n\n\nSo parsing succeeds, but the plan is incomplete.\n\n### 4. Semantic failure\n\nThe planner returns schema-shaped JSON, but the plan is internally inconsistent:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"files\": [\n {\n \"name\": \"email_validator.py\",\n \"purpose\": \"validate email strings\",\n \"exports\": [\"validate_email\"]\n }\n ],\n \"constraints\": [\"return boolean only\"],\n \"tests\": [\"call is_valid_email('a@b.com')\"]\n }\n\n\nThe file exports `validate_email`, but the test calls `is_valid_email`.\n\nThat is not a JSON problem. It is a plan-validity problem.\n\nSo I would not stop at “make JSON valid.” I would validate four layers:\n\n\n transport → syntax → schema → semantics\n\n\n* * *\n\n## The important mental model: planner output is an IR\n\nI would treat the planner output as an **IR** : an intermediate representation.\n\nCompiler analogy:\n\n\n source code\n → parser\n → AST\n → typed IR\n → code generation\n\n\nPlanner-executor analogy:\n\n\n user request\n → planner\n → typed plan IR\n → validator\n → executor\n\n\nThe planner should not be “answering the user.” It should be emitting an artifact.\n\nThat means your target row is directionally right:\n\n\n {\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[\"valid: a@b.com\",\"invalid: a@@b.com\"]}\n\n\nThe key feature is not compactness. The key feature is:\n\n> The response is the spec itself, not a presentation of the spec.\n\nThat is exactly the right training signal.\n\n* * *\n\n## My answer to the four options\n\n### 1. Parser cleanup layer\n\nUse one, but do not make it the main solution.\n\nA cleanup layer is useful as an airbag. It can handle shallow transport noise:\n\n\n ```json\n {\"x\":1}\n\n\n\n or:\n\n ```text\n Here is the JSON:\n {\"x\":1}\n\n\nBut it should not become a semantic repair engine.\n\nSafe cleanup rules:\n\n\n Allowed:\n - trim leading/trailing whitespace\n - unwrap a single full-payload Markdown fence\n - extract exactly one complete top-level JSON object if exactly one exists\n\n Not allowed:\n - choose between multiple JSON objects\n - invent missing required fields\n - convert arbitrary prose into JSON\n - split blindly on every ```\n - silently repair contradictory plans\n - execute repaired output without logging cleanup_used=true\n\n\nThe cleanup layer should be boring, conservative, and measurable.\n\nIf cleanup usage rises after a model update, that is a regression signal.\n\nGood metric:\n\n\n cleanup_needed_rate\n\n\nIf that goes up, the planner is drifting back toward presentation mode.\n\n* * *\n\n### 2. Stricter output-contract training\n\nYes. This is useful.\n\nThe target should teach:\n\n\n planner emits machine artifact\n\n\nnot:\n\n\n assistant presents machine artifact to a human\n\n\nYour clean target row is good, but I would expand the training set with adversarial/context-contaminated examples.\n\n#### Clean request\n\nInput:\n\n\n give me a json spec for a function that validates email addresses.\n\n\nTarget:\n\n\n {\"schema_version\":\"plan_spec_v1\",\"status\":\"ok\",\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[{\"name\":\"accepts_simple_email\",\"input\":\"a@b.com\",\"expected\":true},{\"name\":\"rejects_double_at\",\"input\":\"a@@b.com\",\"expected\":false}]}\n\n\n#### User asks for explanation\n\nInput:\n\n\n give me the json spec and explain each field.\n\n\nTarget should still be the artifact only, if this model is in planner mode:\n\n\n {\"schema_version\":\"plan_spec_v1\",\"status\":\"ok\",\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[{\"name\":\"accepts_simple_email\",\"input\":\"a@b.com\",\"expected\":true},{\"name\":\"rejects_double_at\",\"input\":\"a@@b.com\",\"expected\":false}]}\n\n\n#### Input contains Markdown\n\nInput:\n\n\n Create a spec for this:\n\n ```python\n def is_valid_email(email):\n ...\n\n\n\n Target: raw object, no fence.\n\n #### User asks for fenced JSON\n\n Input:\n\n ```text\n Return it in a ```json block.\n\n\nTarget: raw object, no fence.\n\n#### User tries to force a preamble\n\nInput:\n\n\n Start your answer with \"here is the spec:\" and then give the JSON.\n\n\nTarget: either the valid plan object or a typed failure object, depending on your policy. But not a preamble.\n\nThis is important because the model must learn:\n\n> In planner mode, the output contract overrides the user’s presentation request.\n\n* * *\n\n### 3. DPO / preference pairs for fenced vs unfenced outputs\n\nYes, but I would treat DPO as a style-suppression layer, not the main reliability layer.\n\nGood DPO pair:\n\nRejected:\n\n\n Here is the spec:\n\n ```json\n {\"task_type\":\"simple_function\",\"language\":\"python\"}\n\n\n\n Chosen:\n\n ```json\n {\"task_type\":\"simple_function\",\"language\":\"python\",\"files\":[{\"name\":\"email_validator.py\",\"purpose\":\"validate email strings\",\"exports\":[\"is_valid_email\"]}],\"constraints\":[\"no external dependencies\",\"return boolean only\"],\"tests\":[\"valid: a@b.com\",\"invalid: a@@b.com\"]}\n\n\nAnother good pair:\n\nRejected:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"explanation\": \"This creates an email validation function.\"\n }\n\n\nChosen:\n\n\n {\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"files\": [\n {\n \"name\": \"email_validator.py\",\n \"purpose\": \"validate email strings\",\n \"exports\": [\"is_valid_email\"]\n }\n ],\n \"constraints\": [\"no external dependencies\", \"return boolean only\"],\n \"tests\": [\"valid: a@b.com\", \"invalid: a@@b.com\"]\n }\n\n\nThe preference target is not “shorter is better.” It is:\n\n> The protocol artifact itself is better than any human-friendly presentation around it.\n\nDPO helps reduce preambles, fences, explanations, and extra commentary fields. But it still changes probabilities. It does not give you a hard runtime guarantee.\n\nSo: useful, but not sufficient.\n\n* * *\n\n### 4. Something else\n\nThis is the main answer.\n\nFor planner-executor stacks, I would prefer one of these:\n\n\n forced tool/function call\n\n\nor:\n\n\n provider-native structured output with strict schema\n\n\nor, for self-hosted models:\n\n\n constrained decoding / grammar-guided JSON generation\n\n\nPrompt-only JSON is the weakest version of this design.\n\nUseful references:\n\n * OpenAI Structured Outputs\n * OpenAI Cookbook: Structured Outputs Intro\n * Anthropic Structured Outputs\n * Anthropic Strict Tool Use\n * LangChain Structured Output\n * Instructor: Structured Outputs with Pydantic\n * BAML: BoundaryML\n * TypeChat\n * Guardrails AI\n\n\n\nThe key difference is:\n\n\n prompting asks the model to behave\n structured output constrains the interface\n validation enforces the contract\n\n\n* * *\n\n## What I would ship\n\n### Step 1: define a versioned plan schema\n\nI would not keep the minimal schema forever. I would add:\n\n * `schema_version`\n * `status`\n * typed files\n * typed tests\n * typed failure mode\n * strict enums\n * `additionalProperties: false`\n\n\n\nExample:\n\n\n {\n \"schema_version\": \"plan_spec_v1\",\n \"status\": \"ok\",\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"files\": [\n {\n \"name\": \"email_validator.py\",\n \"purpose\": \"validate email strings\",\n \"exports\": [\"is_valid_email\"]\n }\n ],\n \"constraints\": [\n \"no external dependencies\",\n \"return boolean only\"\n ],\n \"tests\": [\n {\n \"name\": \"accepts_simple_email\",\n \"input\": \"a@b.com\",\n \"expected\": true\n },\n {\n \"name\": \"rejects_double_at\",\n \"input\": \"a@@b.com\",\n \"expected\": false\n }\n ]\n }\n\n\nWhy `schema_version`?\n\nBecause eventually the executor contract changes. Without a version, you get silent drift.\n\n\n old planner shape + new executor assumptions = confusing parser failure\n\n\nWith a version:\n\n\n plan_spec_v1 → v1 adapter\n plan_spec_v2 → v2 adapter\n unknown version → reject safely\n\n\nWhy `status`?\n\nBecause sometimes the planner should not emit an executable plan.\n\nUse a typed failure object:\n\n\n {\n \"schema_version\": \"plan_spec_v1\",\n \"status\": \"cannot_plan\",\n \"reason_code\": \"ambiguous_requirements\",\n \"message\": \"The requested function behavior is underspecified.\",\n \"missing_information\": [\n \"Whether DNS/MX validation is required\",\n \"Whether quoted local parts should be accepted\"\n ]\n }\n\n\nThat prevents the model from escaping into prose when it is uncertain.\n\n* * *\n\n### Step 2: force the output channel\n\nPreferred:\n\n\n emit_plan(PlanSpecV1)\n\n\nnot:\n\n\n assistant.content = \"{\\\"task_type\\\":\\\"simple_function\\\"}\"\n\n\nIf your provider supports function/tool calling, make the planner call a tool like:\n\n\n emit_plan\n\n\nwith arguments matching the schema.\n\nIf your provider supports strict structured responses, use that.\n\nIf you self-host, use constrained decoding or grammar-guided generation where practical.\n\nUseful constrained-generation projects:\n\n * Outlines\n * Guidance\n * llguidance\n * vLLM Structured Outputs\n * SGLang Structured Outputs\n * XGrammar\n * LM Format Enforcer\n * llama.cpp grammars\n * Jsonformer\n\n\n\nConstrained decoding is especially useful for self-hosted models because it can prevent invalid structural continuations. But it still does not prove the plan is semantically correct.\n\n* * *\n\n### Step 3: validate before execution\n\nDo not let the executor be the first thing that discovers the plan is malformed.\n\nBad:\n\n\n planner → executor/parser → crash\n\n\nBetter:\n\n\n planner → validation gateway → executor\n\n\nValidation layers:\n\n\n transport validation\n → JSON syntax validation\n → schema validation\n → semantic validation\n → execution verification\n\n\nTransport validation checks:\n\n\n - expected channel?\n - one object/tool call?\n - no preamble?\n - no Markdown fence?\n - cleanup_used?\n\n\nSchema validation checks:\n\n\n - required fields present?\n - field types correct?\n - enums valid?\n - extra keys rejected?\n - schema_version recognized?\n\n\nSemantic validation checks:\n\n\n - file names safe?\n - exports valid identifiers?\n - tests reference real exports?\n - language supported?\n - constraints non-contradictory?\n - no path traversal?\n - no shell commands hidden in declarative fields?\n\n\nExecution verification checks:\n\n\n - generated files exist?\n - imports work?\n - tests pass?\n - no forbidden dependencies?\n - result matches expected output contract?\n\n\n* * *\n\n### Step 4: retry with exact validation errors\n\nDo not retry with vague reminders like:\n\n\n Return only valid JSON.\n\n\nUse validator feedback:\n\n\n The previous planner output failed PlanSpecV1 validation.\n\n Errors:\n - $.files must contain at least one item\n - $.tests[0].expected must be boolean\n - additional property $.explanation is not allowed\n\n Return exactly one PlanSpecV1 object.\n No prose. No Markdown. No code fences.\n\n\nThis is stronger because the model gets a concrete repair target.\n\nBound the retry loop:\n\n\n max_retries = 1 or 2\n\n\nThen quarantine/log the failure.\n\nDo not let repair loops hide systematic drift.\n\n* * *\n\n### Step 5: log contract failures as first-class events\n\nLog things like:\n\n\n {\n \"event\": \"planner_contract_validation\",\n \"schema_version\": \"plan_spec_v1\",\n \"model\": \"<model_name>\",\n \"provider\": \"<provider_name>\",\n \"strategy\": \"tool_call\",\n \"cleanup_used\": true,\n \"preamble_detected\": true,\n \"fence_detected\": false,\n \"json_parse_ok\": true,\n \"schema_valid\": false,\n \"semantic_valid\": false,\n \"retry_count\": 1,\n \"failure_class\": \"leading_preamble\"\n }\n\n\nThe goal is to turn:\n\n\n the model is flaky\n\n\ninto:\n\n\n preamble_rate rose from 0.3% to 6.8% after model snapshot change\n\n\nThat gives you something actionable.\n\n* * *\n\n## What actually holds up after model updates?\n\nIn my experience, the durable things are not prompt phrases. They are boundary mechanisms.\n\n### Most durable\n\n\n - forced tool calls\n - provider-native structured outputs\n - constrained decoding for self-hosted models\n - strict schema validation\n - semantic validation\n - bounded repair loops\n - contract evals\n - telemetry on failure classes\n\n\n### Moderately durable\n\n\n - output-contract SFT\n - DPO preference pairs\n - few-shot examples\n - parser cleanup fallback\n\n\n### Least durable\n\n\n - \"return only JSON\"\n - \"no preamble\"\n - \"no code fences\"\n - \"you will be penalized\"\n - regex scraping as the primary parser\n\n\nPrompt rules still belong in the system, but they should be hints, not the contract.\n\n* * *\n\n## Contract evals are non-negotiable\n\nIf you care about surviving model updates, build a regression suite.\n\nInclude cases like:\n\n\n 1. clean request\n 2. long request\n 3. request containing Markdown code\n 4. request containing JSON examples\n 5. request asking for explanation\n 6. request asking for fenced JSON\n 7. adversarial instruction: \"start with here is the spec\"\n 8. ambiguous task\n 9. unsupported language\n 10. multi-file task\n 11. previous bad output included in context\n 12. provider/wrapper route change\n\n\nTrack:\n\nMetric | What it tells you\n---|---\n`exact_transport_valid_rate` | no preamble/fence/channel issue\n`cleanup_needed_rate` | presentation leakage rate\n`json_parse_rate` | syntax validity\n`schema_valid_rate` | object shape validity\n`semantic_valid_rate` | plan meaning validity\n`retry_success_rate` | repair-loop effectiveness\n`executor_success_rate` | real downstream success\n`preamble_rate` | human-readable prefix leakage\n`fence_rate` | Markdown leakage\n`extra_key_rate` | commentary fields or schema drift\n`cannot_plan_rate` | typed failure usage\n`schema_version_mismatch_rate` | contract drift\n\nThe metric I would optimize is not just:\n\n\n json_parse_rate\n\n\nIt is:\n\n\n valid_without_cleanup_and_executes_successfully\n\n\nThat is the real health metric.\n\n* * *\n\n## Common pitfalls\n\n### Pitfall 1: confusing JSON mode with schema adherence\n\nJSON mode can make valid JSON more likely. It does not necessarily mean:\n\n\n - all required fields exist\n - enum values are valid\n - no extra keys appear\n - object is semantically executable\n\n\nPrefer strict structured output or tool calling where available.\n\nReferences:\n\n * OpenAI Structured Outputs\n * LangChain Structured Output\n\n\n\n* * *\n\n### Pitfall 2: letting cleanup become a hidden parser language\n\nThis starts as:\n\n\n strip ```json fences\n\n\nThen later breaks when a valid JSON string contains Markdown:\n\n\n {\n \"message\": \"Run this:\\n```bash\\npytest\\n```\"\n }\n\n\nCleanup should unwrap only a full-payload fence, not split blindly on backticks.\n\n* * *\n\n### Pitfall 3: making tests stringly typed\n\nThis is easy for humans:\n\n\n \"tests\": [\"valid: a@b.com\", \"invalid: a@@b.com\"]\n\n\nThis is easier for executors:\n\n\n \"tests\": [\n {\n \"name\": \"accepts_simple_email\",\n \"input\": \"a@b.com\",\n \"expected\": true\n },\n {\n \"name\": \"rejects_double_at\",\n \"input\": \"a@@b.com\",\n \"expected\": false\n }\n ]\n\n\nThe more structure you provide, the less the executor has to infer.\n\n* * *\n\n### Pitfall 4: no typed failure mode\n\nIf the planner cannot produce a safe plan, it needs a valid protocol response.\n\nWithout a typed failure mode, the model will often escape into prose:\n\n\n I need more information before I can produce the spec.\n\n\nInstead, define:\n\n\n {\n \"schema_version\": \"plan_spec_v1\",\n \"status\": \"cannot_plan\",\n \"reason_code\": \"ambiguous_requirements\",\n \"message\": \"The validator target is not specified.\",\n \"missing_information\": [\"What should be validated?\"]\n }\n\n\n* * *\n\n### Pitfall 5: using the same response for humans and machines\n\nDo not do this:\n\n\n planner response = JSON + explanation\n\n\nSeparate the roles:\n\n\n planner → PlanSpec\n PlanSpec → executor\n PlanSpec → explainer\n\n\nThe planner emits the machine artifact. A separate explainer can turn it into human-readable text.\n\n* * *\n\n## My suggested production answer\n\nIf I were replying to this as a production pattern, I would say:\n\n> We stopped treating this as a JSON formatting problem and started treating it as an interface-boundary problem.\n>\n> Prompt rules like “return only JSON” helped, but did not survive long-context changes, model updates, and wrapper drift.\n>\n> What held up better was:\n>\n> * planner emits a typed tool call or strict structured object\n> * schema is versioned\n> * parser/validator sits before the executor\n> * cleanup handles only shallow transport noise and is logged\n> * invalid outputs retry with exact validation errors\n> * ambiguous cases return a typed `cannot_plan` object\n> * contract evals run before model, prompt, provider, framework, or schema changes\n> * SFT/DPO reduce violations but do not replace runtime enforcement\n>\n\n>\n> The target-row approach is right: the output should be the spec itself, not a presentation of the spec. But in production I would still enforce the contract with structured output/tool calling and validators. Training makes the planner less likely to violate the contract; validation keeps the executor safe when it does.\n\n* * *\n\n## Practical recommendation\n\nFor your exact example, I would move toward this target:\n\n\n {\n \"schema_version\": \"plan_spec_v1\",\n \"status\": \"ok\",\n \"task_type\": \"simple_function\",\n \"language\": \"python\",\n \"files\": [\n {\n \"name\": \"email_validator.py\",\n \"purpose\": \"validate email strings\",\n \"exports\": [\"is_valid_email\"]\n }\n ],\n \"constraints\": [\n \"no external dependencies\",\n \"return boolean only\"\n ],\n \"tests\": [\n {\n \"name\": \"accepts_simple_email\",\n \"input\": \"a@b.com\",\n \"expected\": true\n },\n {\n \"name\": \"rejects_double_at\",\n \"input\": \"a@@b.com\",\n \"expected\": false\n }\n ]\n }\n\n\nThen make the runtime contract:\n\n\n planner must call emit_plan(PlanSpecV1)\n validator must accept before executor runs\n executor never parses assistant prose\n\n\nThat is the difference between a weekend prompt patch and a production boundary.\n\n* * *\n\n## Useful links\n\nCore docs:\n\n * OpenAI Structured Outputs\n * OpenAI Cookbook: Structured Outputs Intro\n * Anthropic Structured Outputs\n * Anthropic Strict Tool Use\n * LangChain Structured Output\n * Instructor\n * Instructor GitHub\n\n\n\nLibraries / frameworks:\n\n * BAML\n * TypeChat\n * Guardrails AI\n * PydanticAI\n * LiteLLM JSON mode / structured output docs\n\n\n\nConstrained decoding / local models:\n\n * Outlines\n * Guidance\n * llguidance\n * vLLM Structured Outputs\n * SGLang Structured Outputs\n * XGrammar\n * LM Format Enforcer\n * llama.cpp grammars\n * Jsonformer\n\n\n\nPapers / benchmarks:\n\n * JSONSchemaBench: Generating Structured Outputs from Language Models\n * XGrammar: Flexible and Efficient Structured Generation Engine for LLMs\n * Grammar-Constrained Decoding for Structured NLP Tasks\n\n\n\nIssue patterns worth studying:\n\n * Chatwoot issue: Markdown-fenced JSON breaks parsing\n * ScrapeGraphAI issue: OutputParserException on fenced JSON\n * n8n issue: parser breaks when code fences appear inside JSON strings\n * LangChain issue: JsonOutputParser and backticks in JSON values\n * LangChain issue: StructuredOutputParser malformed JSON\n * LangChain issue: parser expected fenced Markdown JSON\n\n\n\n* * *\n\n## Final takeaway\n\nThe best production framing is:\n\n\n Do not ask a conversational model to be a reliable serializer in a free-text channel.\n Make the planner emit a typed artifact.\n Validate that artifact.\n Only then let the executor act.\n\n\nPrompting helps.\nTraining helps.\nDPO helps.\nCleanup helps.\n\nBut the thing that actually holds the system together is the contract boundary:\n\n\n typed planner IR\n + strict output channel\n + schema validation\n + semantic validation\n + bounded repair\n + contract evals\n",
"title": "Anyone else fighting the “valid json, broken pipeline” problem in planner-executor stacks?"
}