{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicn7pnh2gkv4bdrqcgqqn7uz2om2u2bunmoxu4vplc6l4s6qbhfnq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmv52vg2xp52"
},
"path": "/t/need-generative-model-high-quality-description-generation/176230#post_5",
"publishedAt": "2026-05-28T02:07:18.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"OpenRouter structured outputs",
"OpenRouter API reference",
"Anthropic: Building Effective Agents",
"OpenAI: Working with evals",
"OpenAI: Evaluation best practices",
"Google Search: AI-generated content guidance",
"Google Search spam policies: scaled content abuse",
"Google Search: helpful, reliable, people-first content",
"Google Search update: March 2024",
"pgvector",
"AWS: Transactional outbox pattern",
"Amazon SQS at-least-once delivery",
"Amazon SQS queue types",
"OpenAI Evals GitHub",
"Promptfoo",
"OpenRouter: Gemma 4 26B A4B",
"Hugging Face: google/gemma-4-26B-A4B-it",
"Google Gemma 4 model card",
"Hugging Face: Qwen/Qwen3.6-27B",
"Qwen blog: Qwen3.6-27B",
"Hugging Face: Qwen/Qwen3.6-35B-A3B",
"Hugging Face: mistralai/Mistral-Small-4-119B-2603",
"OpenRouter: Mistral Small 4",
"Hugging Face: CohereLabs/command-a-plus-05-2026-w4a4",
"Cohere: Introducing Command A+",
"Hugging Face PEFT LoRA docs",
"Hugging Face PEFT quantization / QLoRA guide",
"TRL SFTTrainer",
"TRL DPOTrainer"
],
"textContent": "Oh. If your existing production stack is already mostly settled, you can safely treat my earlier vLLM comments as just a from-scratch architecture example and skip that part. The more important point is this: if you use **raw** LLM responses directly, it is hard to keep quality stable at scale. In many cases, the basic pattern is to put a layer between the model output and the published page — usually by having the model produce structured output first:\n\n* * *\n\n## Short version\n\nIf I were building this, I would not start by replacing your backend.\n\nYou already have:\n\n * PostgreSQL\n * Java / Spring Boot\n * React / Node.js\n * AWS hosting\n * a production app\n * fixed operator-page UI sections\n * an API-based plan using OpenRouter or similar providers\n\n\n\nSo I would keep that stack and add an asynchronous **profile content lifecycle** around it.\n\nThe core flow would be:\n\n\n operator data\n ↓\n normalized facts\n ↓\n fact pack\n ↓\n structured generation\n ↓\n validation report\n ↓\n duplicate / SEO quality checks\n ↓\n repair or review\n ↓\n public-unverified / verified / published content\n\n\nThe model writes prose.\nThe application owns facts, consistency, validation, duplicate detection, and publishing decisions.\n\nThat distinction matters. Even a very capable LLM can produce good-looking but invalid text if the raw output is used directly. I would treat model output as a draft, not as the production artifact.\n\nUseful references:\n\n * OpenRouter structured outputs\n * OpenRouter API reference\n * Anthropic: Building Effective Agents\n * OpenAI: Working with evals\n * OpenAI: Evaluation best practices\n * Google Search: AI-generated content guidance\n * Google Search spam policies: scaled content abuse\n\n\n\n* * *\n\n## 1. Keep the backend, make the LLM provider an adapter\n\nI would not move to a new backend unless there is a strong reason.\n\nSpring Boot can remain the source of truth. PostgreSQL can store raw operator data, generation jobs, generated versions, validation results, review states, and publication states.\n\nThe LLM provider should be an adapter:\n\n\n interface ProfileGenerationClient {\n GeneratedProfile generate(ProfileFactPack factPack, GenerationConfig config);\n }\n\n\nInitial implementation:\n\n\n OpenRouterProfileGenerationClient\n\n\nPossible future implementations:\n\n\n DirectProviderClient\n InternalFineTunedModelClient\n SelfHostedModelClient\n\n\nFor every generation, I would store metadata:\n\n\n {\n \"provider\": \"openrouter\",\n \"requested_model\": \"<MODEL_ID>\",\n \"resolved_model\": \"<RESOLVED_MODEL_IF_AVAILABLE>\",\n \"prompt_version\": \"profile_prompt_v7\",\n \"schema_version\": \"operator_profile_schema_v3\",\n \"fact_pack_version\": \"fact_pack_v2\",\n \"temperature\": 0.3,\n \"max_tokens\": 1200,\n \"input_hash\": \"<INPUT_HASH>\",\n \"fact_pack_hash\": \"<FACT_PACK_HASH>\",\n \"output_hash\": \"<OUTPUT_HASH>\"\n }\n\n\nWithout this, it becomes difficult to debug quality changes later.\n\n* * *\n\n## 2. Start with the content contract\n\nBefore prompt engineering, I would define the exact output contract.\n\nSince your UI is fixed, the model should not return arbitrary prose. It should return structured content for your fixed sections.\n\nExample:\n\n\n {\n \"bio\": \"...\",\n \"services_offered\": [\n {\n \"name\": \"...\",\n \"description\": \"...\",\n \"source_fact_ids\": [\"skill_12\", \"category_3\"]\n }\n ],\n \"service_areas\": [\n {\n \"name\": \"Austin, TX\",\n \"source_fact_ids\": [\"location_primary\"]\n }\n ],\n \"faqs\": [\n {\n \"question\": \"...\",\n \"answer\": \"...\",\n \"source_fact_ids\": [\"skill_12\", \"rate_1\"]\n }\n ],\n \"seo\": {\n \"title\": \"...\",\n \"meta_description\": \"...\"\n },\n \"claims_used\": [\n {\n \"claim\": \"The operator provides appliance repair in Austin, TX.\",\n \"source_fact_ids\": [\"category_3\", \"location_primary\"]\n }\n ],\n \"unsupported_claims\": [],\n \"risk_flags\": []\n }\n\n\nThe important part is `source_fact_ids`.\n\nThe model should not only write text. It should say which input facts support the generated claim. That makes downstream validation much easier.\n\nOpenRouter structured outputs can help enforce the response shape:\n\n * OpenRouter structured outputs\n\n\n\nBut structured output is not the same as factual output.\n\nThis can be valid JSON and still be business-invalid:\n\n\n {\n \"bio\": \"Austin-based certified appliance repair specialist with same-day service.\",\n \"claims_used\": [\"certified\", \"same-day service\"],\n \"unsupported_claims\": []\n }\n\n\nIf the operator did not provide certification or availability facts, that content should be rejected even if the JSON is valid.\n\nSo I would split validation into:\n\n\n JSON/schema validation:\n checks shape\n\n business validation:\n checks factuality, forbidden claims, duplicates, SEO risk, and publishability\n\n\n* * *\n\n## 3. Build a fact pack before generation\n\nI would not send the raw operator record directly to the model.\n\nConvert raw operator data into a **fact pack** first.\n\nExample:\n\n\n {\n \"operator_id\": \"op_123\",\n \"allowed_facts\": [\n {\n \"id\": \"service_primary\",\n \"type\": \"service\",\n \"value\": \"appliance repair\"\n },\n {\n \"id\": \"location_primary\",\n \"type\": \"location\",\n \"value\": \"Austin, TX\"\n },\n {\n \"id\": \"experience_years\",\n \"type\": \"experience\",\n \"value\": 7\n },\n {\n \"id\": \"skill_1\",\n \"type\": \"skill\",\n \"value\": \"washer repair\"\n }\n ],\n \"forbidden_claims\": [\n \"licensed\",\n \"insured\",\n \"certified\",\n \"top-rated\",\n \"best\",\n \"guaranteed\",\n \"same-day service\",\n \"24/7 emergency service\",\n \"5-star reviews\"\n ],\n \"missing_fact_classes\": [\n \"insurance\",\n \"certifications\",\n \"reviews\",\n \"availability\",\n \"service_radius\"\n ],\n \"content_limits\": {\n \"max_bio_words\": 140,\n \"max_faq_count\": 2,\n \"allow_faq\": true\n }\n }\n\n\nMissing data should become explicit constraints.\n\nFor example:\n\n\n insurance = null\n\n\nshould become:\n\n\n Do not claim insured.\n\n\nAnd:\n\n\n reviews_summary = null\n\n\nshould become:\n\n\n Do not claim highly reviewed, 5-star, top-rated, or customer-loved.\n\n\nThe model should not decide what missing data means. The application should decide.\n\n* * *\n\n## 4. Use a multi-step generation flow\n\nI would avoid this:\n\n\n input row → one prompt → final paragraph → publish\n\n\nThat is fragile at scale.\n\nI would use a workflow:\n\n\n 1. Normalize operator input\n 2. Build fact pack\n 3. Decide content policy\n 4. Generate content plan\n 5. Validate content plan\n 6. Generate structured profile JSON\n 7. Validate schema\n 8. Validate factuality\n 9. Validate forbidden claims\n 10. Validate SEO/content quality\n 11. Check duplicate / near-duplicate risk\n 12. Repair or regenerate\n 13. Decide publishing state\n 14. Store content version + validation report\n\n\nThis is close to the workflow patterns described by Anthropic, especially prompt chaining and evaluator-optimizer:\n\n * Anthropic: Building Effective Agents\n\n\n\nThe model should not own the whole workflow.\n\nThe model can write the words.\nThe application should decide what is allowed, what is invalid, what needs review, and what can be published.\n\n* * *\n\n## 5. Insert a content-plan step\n\nBefore final content generation, I would ask for a plan.\n\nExample:\n\n\n {\n \"bio_plan\": {\n \"angle\": \"practical local appliance repair help\",\n \"facts_to_use\": [\n \"service_primary\",\n \"location_primary\",\n \"experience_years\",\n \"skill_1\"\n ],\n \"facts_to_avoid\": [\n \"insurance\",\n \"certifications\",\n \"reviews\",\n \"availability\"\n ]\n },\n \"faq_plan\": [\n {\n \"question_type\": \"service_scope\",\n \"source_fact_ids\": [\"service_primary\", \"skill_1\"]\n }\n ],\n \"skip_sections\": [\n {\n \"section\": \"certifications\",\n \"reason\": \"no certification facts were provided\"\n }\n ]\n }\n\n\nThen validate the plan before generating final copy.\n\nIf the plan already includes:\n\n\n certified technician\n same-day service\n top-rated\n 5-star reviews\n\n\nand those facts are not in the fact pack, reject the plan before the final content is generated.\n\n* * *\n\n## 6. Store validation reports\n\nFor every generated profile, I would store a validation report.\n\nExample:\n\n\n {\n \"schema\": {\n \"status\": \"pass\",\n \"errors\": []\n },\n \"factuality\": {\n \"status\": \"fail\",\n \"unsupported_claims\": [\n {\n \"claim\": \"insured\",\n \"reason\": \"insurance was not present in the fact pack\"\n }\n ]\n },\n \"forbidden_claims\": {\n \"status\": \"pass\",\n \"violations\": []\n },\n \"seo_quality\": {\n \"status\": \"warn\",\n \"issues\": [\n \"FAQ answer is generic\",\n \"bio uses low-specificity wording\"\n ]\n },\n \"duplication\": {\n \"status\": \"pass\",\n \"nearest_profile_id\": \"op_987\",\n \"similarity\": 0.78\n },\n \"decision\": \"repair\"\n }\n\n\nThis report is useful for:\n\n * debugging failed generations\n * explaining why a profile went to review\n * improving prompts\n * comparing models\n * building future evals\n * creating future fine-tuning or preference data\n\n\n\nWithout validation reports, you only have “the model wrote something.”\nWith validation reports, you have a system you can improve.\n\n* * *\n\n## 7. Separate generated, public-unverified, verified, and published\n\nI would not use `READY` to mean “trusted.”\n\nI would separate these states:\n\nState | Meaning\n---|---\n`GENERATED_READY` | Generated and passed automated checks\n`PUBLIC_UNVERIFIED` | Publicly visible, but not manually/proof verified\n`VERIFIED` | Important operator facts have been verified\n`REVIEW_REQUIRED` | Should not be auto-published\n`PUBLISHED` | Currently rendered on the live page\n\nThe key distinction:\n\n\n generated != verified\n\n\nYour idea of generating quickly and adding a verified tag later is reasonable. I would just make that distinction explicit in the data model and UI.\n\nA profile can be generated in 1–2 minutes and shown as public-unverified.\nIt can become verified later after proof, human review, or platform verification.\n\n* * *\n\n## 8. Use risk-based review, not full human-in-the-loop\n\nI would not review every generated profile before publication unless the category is sensitive or legally risky.\n\nFull human-in-the-loop can be too slow for onboarding.\n\nInstead:\n\n\n Auto-publish as PUBLIC_UNVERIFIED if:\n - schema is valid\n - no unsupported claims\n - no forbidden claims\n - duplicate score is low\n - fact density is sufficient\n - no suspicious operator patterns\n - no high-risk service category\n\n\nSend to review if:\n\n\n REVIEW_REQUIRED if:\n - unsupported claims were detected\n - forbidden claims were detected\n - duplicate similarity is high\n - sparse input produced long output\n - repeated repair attempts failed\n - operator data looks suspicious\n - service category is high risk\n\n\nThis keeps onboarding fast while still protecting quality.\n\n* * *\n\n## 9. Treat SEO quality as a policy, not a prompt phrase\n\nI would avoid making the main instruction:\n\n\n Write SEO-friendly content.\n\n\nThat can produce filler, keyword stuffing, and city/service boilerplate.\n\nI would define the target as:\n\n\n useful, fact-grounded, operator-specific, non-duplicative content\n\n\nRelevant Google references:\n\n * Google Search: AI-generated content guidance\n * Google Search spam policies: scaled content abuse\n * Google Search: helpful, reliable, people-first content\n * Google Search update: March 2024\n\n\n\nThe risk is not “AI wrote it.”\nThe risk is generating many low-value, near-duplicate, weakly grounded pages.\n\nSEO/content quality gate:\n\n\n - Does this profile contain enough operator-specific facts?\n - Are service areas supported by input data?\n - Are FAQs grounded in actual facts?\n - Is the title/meta keyword-stuffed?\n - Is this page too similar to other city/service pages?\n - Is this sparse profile being inflated into a long page?\n - Should this page be short, noindex, or review-required until more facts are collected?\n\n\nMost important rule:\n\n\n Sparse inputs should produce short profiles, not inflated pages.\n\n\nIf the operator only provides a city and one service, do not generate a long bio and five FAQs. That creates both hallucination risk and SEO risk.\n\n* * *\n\n## 10. Measure uniqueness instead of asking for it\n\nI would not rely on this instruction:\n\n\n Write a unique description.\n\n\nI would measure uniqueness.\n\nLayer | Check\n---|---\n1 | normalized text hash\n2 | repeated phrase / sentence pattern\n3 | n-gram overlap\n4 | embedding similarity\n5 | same-city + same-service comparison\n6 | operator-data duplicate detection\n\nSince you already use PostgreSQL, pgvector is a practical option for vector similarity search.\n\nExample:\n\n\n SELECT\n id,\n operator_id,\n service,\n city,\n embedding <=> <QUERY_EMBEDDING> AS cosine_distance\n FROM operator_profile_versions\n WHERE service = <SERVICE>\n AND city = <CITY>\n ORDER BY embedding <=> <QUERY_EMBEDDING>\n LIMIT 10;\n\n\nPossible policy:\n\n\n if exact_hash_match:\n reject\n\n if ngram_overlap > threshold:\n regenerate\n\n if embedding_similarity > threshold and same_city_same_service:\n review_required\n\n if operator_data_duplicate_score > threshold:\n block_or_manual_review\n\n\nThe thresholds should come from your own data.\n\nKey idea:\n\n\n Uniqueness should be a measured property, not a prompt instruction.\n\n\n* * *\n\n## 11. Make async generation reliable\n\nYour workflow has this shape:\n\n\n 1. Save operator record in Postgres\n 2. Push generation job to queue\n\n\nThat creates a classic dual-write problem.\n\nThe DB write can succeed while queue publish fails. Or queue publish can happen twice. Or the worker can receive the same job more than once.\n\nI would use the transactional outbox pattern:\n\n * AWS: Transactional outbox pattern\n\n\n\nFlow:\n\n\n Spring Boot transaction:\n - save operator record\n - insert generation_job\n - insert outbox_event\n\n Outbox publisher:\n - reads unpublished outbox rows\n - sends message to SQS or worker queue\n - marks outbox row as published\n\n Worker:\n - consumes job\n - checks idempotency key\n - builds fact pack\n - generates content\n - validates content\n - writes content version + validation report\n\n\nIf you use SQS Standard queues, design for at-least-once delivery. AWS documents that messages may be delivered more than once and consumers should be idempotent:\n\n * Amazon SQS at-least-once delivery\n * Amazon SQS queue types\n\n\n\nJob payload:\n\n\n {\n \"job_id\": \"<JOB_ID>\",\n \"operator_id\": \"<OPERATOR_ID>\",\n \"input_hash\": \"<INPUT_HASH>\",\n \"fact_pack_hash\": \"<FACT_PACK_HASH>\",\n \"prompt_version\": \"<PROMPT_VERSION>\",\n \"schema_version\": \"<SCHEMA_VERSION>\",\n \"attempt_number\": 1,\n \"idempotency_key\": \"<IDEMPOTENCY_KEY>\"\n }\n\n\n* * *\n\n## 12. Build private evals before choosing the model\n\nPublic leaderboards are useful for discovery, but they do not measure your exact task.\n\nI would create an offline eval set:\n\n\n 100-300 real or representative operator records\n\n\nInclude difficult cases:\n\n\n - rich operator data\n - sparse operator data\n - same city + same service\n - missing rate\n - missing experience\n - missing insurance\n - missing certifications\n - no reviews\n - ambiguous service area\n - bot-like duplicate registrations\n\n\nEvaluate models and prompts on:\n\n\n schema_pass_rate\n unsupported_claim_rate\n forbidden_claim_rate\n required_fact_inclusion_rate\n duplicate_risk_rate\n sparse_profile_inflation_rate\n FAQ_grounding_rate\n repair_attempts_per_accepted_output\n human_acceptance_rate\n latency\n accepted_output_cost\n\n\nReferences:\n\n * OpenAI: Working with evals\n * OpenAI: Evaluation best practices\n * OpenAI Evals GitHub\n * Promptfoo\n\n\n\nDo not choose based on five nice-looking examples.\n\nChoose based on accepted-output cost:\n\n\n accepted_output_cost =\n first_generation_cost\n + repair_generation_cost\n + validation_cost\n + duplicate-regeneration cost\n + human-review cost, if triggered\n\n\nA cheaper model may be more expensive in production if it causes more repairs and reviews.\n\n* * *\n\n## 13. Model shortlist I would test\n\nI would still avoid choosing the model from public vibes alone.\n\nBut if I had to build an initial shortlist, I would test models that cover different tradeoffs:\n\nCandidate | Why test it\n---|---\n`google/gemma-4-26B-A4B-it` | First practical candidate; strong size/performance profile\n`google/gemma-4-31B-it` | Gemma-family quality ceiling\n`Qwen/Qwen3.6-27B` | Dense 27B challenger\n`Qwen/Qwen3.6-35B-A3B` | Efficient MoE challenger\n`mistralai/Mistral-Small-4-119B-2603` | Heavier quality comparison\n`CohereLabs/command-a-plus-05-2026-w4a4` | Enterprise/business prose comparison\n`moonshotai/Kimi-K2-Instruct-0905` | Upper-bound comparison\n`meta-llama/Llama-3.3-70B-Instruct` | Stable baseline\n\n### Why Gemma 4 should be included\n\nI would definitely include the Gemma 4 family, especially:\n\n\n google/gemma-4-26B-A4B-it\n google/gemma-4-31B-it\n\n\n`google/gemma-4-26B-A4B-it` is interesting because it is a Mixture-of-Experts model. OpenRouter describes it as 25.2B total parameters with only 3.8B active per token, 256K context, structured output support, function calling, reasoning mode, and Apache 2.0 licensing:\n\n * OpenRouter: Gemma 4 26B A4B\n * Hugging Face: google/gemma-4-26B-A4B-it\n * Google Gemma 4 model card\n\n\n\nFor this task, I would treat it as the first practical candidate.\n\nI would use:\n\n\n Gemma 4 26B A4B:\n first model to try\n strong size/performance candidate\n good API-evaluation candidate\n\n Gemma 4 31B:\n quality ceiling inside Gemma 4\n useful to check whether A4B loses anything important\n\n\n### Why Qwen3.6 should be included\n\nI would also test:\n\n\n Qwen/Qwen3.6-27B\n Qwen/Qwen3.6-35B-A3B\n\n\n`Qwen/Qwen3.6-27B` is a strong dense comparison point. Its model card says the artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and similar runtimes:\n\n * Hugging Face: Qwen/Qwen3.6-27B\n * Qwen blog: Qwen3.6-27B\n\n\n\nI would test it for:\n\n\n - instruction following\n - JSON/schema stability\n - factual discipline\n - natural business prose\n - repair rate\n\n\n`Qwen/Qwen3.6-35B-A3B` is also worth testing as an efficient MoE-style challenger:\n\n * Hugging Face: Qwen/Qwen3.6-35B-A3B\n\n\n\n### Why Mistral Small 4 and Command A+ might be useful\n\nI would include Mistral Small 4 if budget and latency allow it:\n\n * Hugging Face: mistralai/Mistral-Small-4-119B-2603\n * OpenRouter: Mistral Small 4\n\n\n\nI would not use it because every profile needs heavy reasoning. I would use it to see whether a stronger model reduces:\n\n\n - unsupported claims\n - repair attempts\n - generic filler\n - duplicate-like phrasing\n - awkward FAQ output\n\n\nI would also test Command A+ if business/enterprise prose is important:\n\n * Hugging Face: CohereLabs/command-a-plus-05-2026-w4a4\n * Cohere: Introducing Command A+\n\n\n\nCommand A+ is interesting as an enterprise/business prose comparison, not necessarily as the first production choice.\n\n### How I would choose\n\nI would not choose based on first-generation prose quality alone.\n\nFor each model, I would measure:\n\nMetric | Meaning\n---|---\n`schema_pass_rate` | Does it follow the JSON contract?\n`unsupported_claim_rate` | Does it invent facts?\n`forbidden_claim_rate` | Does it output banned claims?\n`duplicate_risk_rate` | Does it produce near-duplicate text?\n`sparse_profile_inflation_rate` | Does it inflate weak input?\n`repair_attempts_per_accepted_output` | How often does it need fixing?\n`human_acceptance_rate` | Do reviewers accept it?\n`accepted_output_cost` | True cost after validation/repair/review\n\nMy initial practical bet would be:\n\n\n Start with:\n google/gemma-4-26B-A4B-it\n Qwen/Qwen3.6-27B\n mistralai/Mistral-Small-4-119B-2603\n\n Then expand to:\n google/gemma-4-31B-it\n Qwen/Qwen3.6-35B-A3B\n CohereLabs/command-a-plus-05-2026-w4a4\n\n\nIf Gemma 4 26B A4B gives strong validation pass rates and low repair rates, I would favor it as the first production candidate because of its size/performance profile.\n\nIf Qwen3.6 follows constraints better, I would choose Qwen.\n\nIf Mistral Small 4 dramatically reduces unsupported claims and repair attempts, I would consider paying more for it.\n\nThe model decision should come after the pipeline exists, because the pipeline defines what “good” means.\n\n* * *\n\n## 14. Improve operator input UX\n\nIf the operator data is weak, the model has only two safe choices:\n\n\n write short content\n or ask for more data\n\n\nThe unsafe choice is:\n\n\n inflate sparse data into a long profile\n\n\nSo I would improve the onboarding form.\n\nCollect structured fields like:\n\n\n - primary service\n - secondary services\n - city / service area\n - years of experience\n - license / certification\n - insurance\n - languages\n - availability\n - rate / price range\n - specialties\n - customer type\n - examples of work\n - short self-written note\n - proof fields for verified claims\n\n\nThen use a fact density score:\n\nFact density | Content policy\n---|---\nhigh | full profile, services, FAQ, SEO title/meta\nmedium | shorter bio, limited FAQ\nlow | short profile only, ask for more facts, maybe public-unverified or noindex\n\nThis may improve SEO quality more than changing the model.\n\nThe best way to make useful pages is to collect useful facts.\n\n* * *\n\n## 15. Use content versioning\n\nDo not overwrite generated content in place.\n\nPossible tables:\n\n\n operator\n operator_profile\n operator_profile_version\n generation_job\n generation_outbox\n generation_validation_report\n profile_embedding\n manual_review_task\n operator_edit\n\n\nEach generated version should store:\n\n\n operator_id\n profile_version_id\n generated_json\n published_json\n validation_report\n source_fact_hash\n prompt_version\n schema_version\n provider\n model\n generation_params\n created_at\n published_at\n verified_at\n\n\nThis matters because:\n\n * the operator may edit AI content\n * the platform may verify claims later\n * a new model may regenerate content\n * reviewers may approve or reject changes\n * you need rollback\n * edits become useful future eval/fine-tuning data\n\n\n\n* * *\n\n## 16. Do not start with fine-tuning\n\nFine-tuning can help later, but I would not start there.\n\nFirst build:\n\n\n - content schema\n - fact pack\n - validators\n - duplicate checks\n - private evals\n - validation reports\n - review states\n\n\nOnly after that would I consider fine-tuning.\n\nLater, you can use:\n\n\n operator facts\n + generated output\n + validation report\n + operator edits\n + reviewer decisions\n\n\nto create:\n\n\n SFT data:\n fact pack → good structured profile JSON\n\n Preference data:\n chosen good output vs rejected bad output\n\n Verifier data:\n fact pack + generated profile → validation report\n\n\nIf you fine-tune, I would start with LoRA/QLoRA rather than full fine-tuning:\n\n * Hugging Face PEFT LoRA docs\n * Hugging Face PEFT quantization / QLoRA guide\n * TRL SFTTrainer\n * TRL DPOTrainer\n\n\n\nBut that is a later phase.\n\n* * *\n\n## Practical build order\n\n### Phase 1: Offline prototype\n\n\n 1. Collect 100-300 representative operator records\n 2. Define content schema\n 3. Define fact pack schema\n 4. Define forbidden claims\n 5. Generate outputs with 2-4 models\n 6. Validate schema\n 7. Validate factuality\n 8. Check duplicate risk\n 9. Human-review 30-50 outputs\n 10. Tune prompt/schema/validators\n\n\n### Phase 2: MVP generation pipeline\n\n\n 1. Add generation_job table\n 2. Add content version table\n 3. Add validation report table\n 4. Add outbox table\n 5. Add worker\n 6. Add OpenRouter adapter\n 7. Add structured output\n 8. Add schema validation\n 9. Add basic fact/forbidden-claim checks\n 10. Add repair loop\n\n\n### Phase 3: SEO and duplicate safety\n\n\n 1. Add fact density scoring\n 2. Add sparse profile policy\n 3. Add n-gram duplicate checks\n 4. Add embeddings\n 5. Add pgvector similarity search\n 6. Add same-city/service duplicate policy\n 7. Add noindex/review-required rules for weak pages\n\n\n### Phase 4: Review and verification\n\n\n 1. Add PUBLIC_UNVERIFIED state\n 2. Add REVIEW_REQUIRED state\n 3. Add VERIFIED state\n 4. Add reviewer UI\n 5. Add operator edit UI\n 6. Store edits and review decisions\n\n\n### Phase 5: Model and tuning improvements\n\n\n 1. Run private evals regularly\n 2. Compare models by accepted-output cost\n 3. Add best-of-N generation if needed\n 4. Build verifier/reward model if useful\n 5. Consider LoRA/QLoRA or DPO after enough data exists\n\n\n* * *\n\n## What I would avoid\n\nI would avoid:\n\n\n input row → one prompt → final paragraph → publish\n\n\nI would avoid treating `READY` as trusted.\n\nI would avoid writing long content for sparse operators.\n\nI would avoid asking the model to make pages unique without measuring duplication.\n\nI would avoid making “SEO-friendly” the main instruction.\n\nI would avoid fine-tuning before you have evals and validation data.\n\nI would avoid coupling business logic directly to one LLM provider.\n\n* * *\n\n## Final summary\n\nIf I were building this from scratch, I would build a system that controls whether generated content is:\n\n\n allowed\n grounded\n distinct\n useful\n publishable\n reviewable\n verifiable\n versioned\n\n\nThe LLM is only the prose-generation component.\n\nMy first priorities would be:\n\n\n 1. content contract\n 2. fact pack\n 3. structured output\n 4. validation report\n 5. duplicate scoring\n 6. SEO/content quality policy\n 7. public-unverified vs verified states\n 8. private evals\n 9. reliable async jobs\n 10. operator input improvement\n 11. model comparison by accepted-output cost\n\n\nThe central rule:\n\n> The model can write the words, but the application should own the truth, consistency, publishing policy, and quality gates.",
"title": "Need generative model, high-quality description generation"
}