Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihghqhbg2u3auidn7nuax4jmpiscttsu554kuhkhl2xskrkwnwsyy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmst6diyuih2"
  },
  "path": "/t/need-generative-model-high-quality-description-generation/176230#post_2",
  "publishedAt": "2026-05-27T04:17:09.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "vLLM OpenAI-compatible server",
    "vLLM structured outputs",
    "Anthropic: Building Effective Agents",
    "OpenAI: Evaluation best practices",
    "OpenAI: Working with evals",
    "OpenAI Cookbook: Eval driven system design",
    "Google Search: AI-generated content guidance",
    "Google Search spam policies: scaled content abuse",
    "OpenAI Evals GitHub",
    "Your AI Product Needs Evals",
    "Creating a LLM-as-a-Judge That Drives Business Results",
    "Building LLM applications for production",
    "Hamel Husain: LLM-as-a-Judge",
    "Pydantic AI output docs",
    "Instructor",
    "Guardrails AI",
    "pgvector",
    "Hamel Husain: Your AI Product Needs Evals",
    "Promptfoo",
    "Hugging Face Hub: download files and pin revisions",
    "Google Search: helpful, reliable, people-first content",
    "Google Search update: March 2024",
    "Ahrefs: Programmatic SEO",
    "Ahrefs: Duplicate content",
    "Celery documentation",
    "Temporal",
    "Temporal Python SDK error handling"
  ],
  "textContent": "If you don’t insist on having the LLM complete everything by itself, this may be simpler:\n\n* * *\n\n## Short answer\n\nI would treat this less as a “find the perfect generative model” problem and more as a **pipeline design** problem.\n\nFor this kind of description-generation task, I would probably use **vLLM** as the inference backend, run one or more reputable Hugging Face models behind it, and put most of the engineering effort into the surrounding pipeline:\n\n  1. normalize the structured input,\n  2. build a fact pack,\n  3. generate a content plan,\n  4. generate the description,\n  5. validate factuality,\n  6. validate style and banned claims,\n  7. check near-duplicates,\n  8. repair or regenerate,\n  9. log model/prompt/schema versions and eval results.\n\n\n\nThe model still matters, of course. But if the task is to generate thousands of profile/location/service descriptions, the main risk is usually not “the paragraph is not poetic enough.” The main risks are:\n\n  * unsupported facts,\n  * generic filler,\n  * near-duplicate pages,\n  * unsafe claims,\n  * SEO-thin pages,\n  * inability to compare model/prompt changes later.\n\n\n\nSo I would keep the model swappable and make the pipeline the main product.\n\nUseful references:\n\n  * vLLM OpenAI-compatible server\n  * vLLM structured outputs\n  * Anthropic: Building Effective Agents\n  * OpenAI: Evaluation best practices\n  * OpenAI: Working with evals\n  * OpenAI Cookbook: Eval driven system design\n  * Google Search: AI-generated content guidance\n  * Google Search spam policies: scaled content abuse\n\n\n\n* * *\n\n## Why I would not optimize only for “the best model”\n\nThere are many decent open models on Hugging Face now. Some Qwen, Llama, Mistral, Gemma, and Command-family models can produce good profile or marketing prose.\n\nBut for this use case, a better model alone does not solve the main operational problems.\n\nA stronger model may still:\n\n  * hallucinate credentials,\n  * add unsupported service areas,\n  * overstate experience,\n  * invent availability,\n  * invent review quality,\n  * produce generic SEO-ish filler,\n  * repeat similar sentence structures across thousands of pages,\n  * silently change behavior after a model, prompt, or runtime update,\n  * produce good-looking text that fails a business rule.\n\n\n\nThat is why I would avoid a pure “model shootout” approach.\n\nA model shootout is still useful, but only after defining task-specific evals. General benchmark strength is not the same as quality on this exact task.\n\nOpenAI’s eval guidance is useful here because it frames evals as a way to test AI systems despite generative variability:\n\n  * OpenAI: Evaluation best practices\n  * OpenAI: Working with evals\n  * OpenAI Evals GitHub\n\n\n\nHamel Husain’s writing is also useful from a practical engineering point of view:\n\n  * Your AI Product Needs Evals\n  * Creating a LLM-as-a-Judge That Drives Business Results\n\n\n\nChip Huyen’s production LLM article is also a good reference for the idea that LLM applications should be tested as systems, not just prompts:\n\n  * Building LLM applications for production\n\n\n\nThe short version:\n\n> Do not ask “which model writes the nicest description?” first.\n>  Ask “which pipeline reliably turns structured facts into useful, factual, non-duplicative descriptions?”\n\n* * *\n\n## Proposed backend shape\n\nI would use this architecture:\n\n\n    Admin / API\n      ↓\n    FastAPI\n      ↓\n    Postgres\n      ↓\n    Celery or Temporal\n      ↓\n    Workers\n      ├─ normalize_input\n      ├─ build_fact_pack\n      ├─ generate_content_plan\n      ├─ generate_description\n      ├─ fact_check\n      ├─ style_check\n      ├─ duplicate_check\n      ├─ repair_or_regenerate\n      └─ publish_or_export\n            ↓\n    vLLM OpenAI-compatible server\n            ↓\n    HF model weights\n\n\nSuggested starting stack:\n\n\n    Inference:\n      vLLM\n\n    API:\n      FastAPI\n\n    Database:\n      Postgres\n\n    Vector similarity:\n      pgvector\n\n    Queue / jobs:\n      Celery + Redis for MVP\n      Temporal later if workflows become complex\n\n    Validation:\n      Pydantic\n      Instructor or similar structured-output helper\n\n    Storage:\n      S3 / R2 / MinIO\n\n    Monitoring:\n      structured logs\n      token/latency/cost counters\n      eval dashboards\n\n\nWhy vLLM?\n\nvLLM gives you an OpenAI-compatible HTTP server, which makes it easier to keep your application code stable while swapping the underlying HF model:\n\n  * vLLM OpenAI-compatible server\n\n\n\nIt also supports structured outputs, which is useful if you want the model to return a schema like this:\n\n\n    {\n      \"content_plan\": {\n        \"angle\": \"experienced bilingual local technician\",\n        \"paragraphs\": [\n          \"Introduce the service and location\",\n          \"Mention supported skills and experience\",\n          \"Close with practical customer benefit\"\n        ]\n      },\n      \"included_facts\": [\n        \"Austin, TX\",\n        \"7 years of experience\",\n        \"washer repair\",\n        \"dryer repair\",\n        \"$85/hour\"\n      ],\n      \"unsupported_claims\": [],\n      \"final_description\": \"<generated description>\"\n    }\n\n\nReference:\n\n  * vLLM structured outputs\n\n\n\nThe point is not that structured output magically guarantees truth. It does not. The point is that it gives the rest of your application something inspectable.\n\n* * *\n\n## Why a pipeline fits this task better than one-shot generation\n\nThis task is a good match for a fixed workflow.\n\nAnthropic’s “Building Effective Agents” post is useful here because it separates relatively deterministic **workflows** from more open-ended **agents**. In particular, it describes:\n\n  * prompt chaining,\n  * routing,\n  * parallelization,\n  * orchestrator-workers,\n  * evaluator-optimizer.\n\n\n\nReference:\n\n  * Anthropic: Building Effective Agents\n\n\n\nFor this problem, I would use something closer to **prompt chaining** and **evaluator-optimizer** , not a fully autonomous agent.\n\nA simple generation pipeline might look like this:\n\n\n    Raw row\n      ↓\n    Normalized facts\n      ↓\n    Fact pack\n      ↓\n    Content plan\n      ↓\n    Draft description\n      ↓\n    Factuality check\n      ↓\n    Style / banned-claim check\n      ↓\n    Duplicate check\n      ↓\n    Repair or regenerate\n      ↓\n    Approved output\n\n\nThat is easier to test than a giant prompt that says:\n\n\n    Write a unique, high-quality, SEO-friendly, factual local service description.\n\n\nThe giant prompt may work for 20 examples. It is much less safe for 10,000+ examples.\n\n* * *\n\n## Step 1: Normalize the input first\n\nBefore calling the LLM, normalize the input into a strict schema.\n\nExample:\n\n\n    {\n      \"profile_id\": \"<PROFILE_ID>\",\n      \"service\": \"appliance repair\",\n      \"city\": \"Austin\",\n      \"state\": \"TX\",\n      \"rate\": {\n        \"amount\": 85,\n        \"currency\": \"USD\",\n        \"unit\": \"hour\"\n      },\n      \"experience_years\": 7,\n      \"skills\": [\n        \"washer repair\",\n        \"dryer repair\",\n        \"refrigerator diagnostics\"\n      ],\n      \"languages\": [\n        \"English\",\n        \"Spanish\"\n      ],\n      \"certifications\": [],\n      \"insurance\": null,\n      \"reviews_summary\": null\n    }\n\n\nThis is not just cleanup. It prevents the model from guessing what missing fields mean.\n\nFor example:\n\n  * if `certifications` is empty, do not allow “certified”;\n  * if `insurance` is null, do not allow “insured”;\n  * if `reviews_summary` is null, do not allow “highly reviewed” or “5-star”;\n  * if no availability is provided, do not allow “same-day service”;\n  * if no service radius is provided, do not invent nearby cities.\n\n\n\nThe LLM should receive not only the raw facts but also the allowed and forbidden claims.\n\n* * *\n\n## Step 2: Build a fact pack\n\nI would explicitly build a fact pack before writing.\n\nExample:\n\n\n    {\n      \"allowed_claims\": [\n        \"The provider offers appliance repair in Austin, TX.\",\n        \"The provider has 7 years of experience.\",\n        \"The provider handles washer repair, dryer repair, and refrigerator diagnostics.\",\n        \"The provider speaks English and Spanish.\",\n        \"The listed rate is $85/hour.\"\n      ],\n      \"forbidden_claims\": [\n        \"licensed\",\n        \"insured\",\n        \"certified\",\n        \"top-rated\",\n        \"best in Austin\",\n        \"guaranteed same-day service\",\n        \"5-star reviews\",\n        \"background checked\",\n        \"family-owned\",\n        \"emergency service\"\n      ],\n      \"missing_fields\": [\n        \"certifications\",\n        \"insurance\",\n        \"reviews\",\n        \"availability\",\n        \"service_radius\"\n      ]\n    }\n\n\nThis makes the generation task much easier:\n\n> Write a description using only these allowed claims.\n>  Do not use any forbidden claims.\n>  Omit missing facts naturally.\n\nThis is also useful for auditing later.\n\nIf a generated page says “insured”, you can check whether `insured` was ever present in the fact pack. If it was not, the output is invalid.\n\n* * *\n\n## Step 3: Generate a content plan before final prose\n\nInstead of asking for the final description immediately, ask the model to make a small plan.\n\nExample output:\n\n\n    {\n      \"angle\": \"practical local appliance repair help\",\n      \"paragraph_plan\": [\n        {\n          \"goal\": \"Introduce service, location, and main skills\",\n          \"facts_to_use\": [\"service\", \"city\", \"state\", \"skills\"]\n        },\n        {\n          \"goal\": \"Mention experience and rate without sounding salesy\",\n          \"facts_to_use\": [\"experience_years\", \"rate\"]\n        },\n        {\n          \"goal\": \"Close with a customer-oriented sentence\",\n          \"facts_to_use\": [\"languages\"]\n        }\n      ],\n      \"style_constraints\": [\n        \"professional\",\n        \"plainspoken\",\n        \"no exaggerated marketing claims\",\n        \"no unsupported credentials\"\n      ]\n    }\n\n\nThis intermediate step gives you something to validate before prose generation.\n\nIf the plan already includes “certified technician” but the fact pack has no certification, reject the plan before generating the final text.\n\n* * *\n\n## Step 4: Generate the description\n\nThen generate the actual description.\n\nExample prompt shape:\n\n\n    You write local service marketplace profile descriptions.\n\n    Use ONLY the facts in FACT_PACK.\n    Do not invent credentials, awards, insurance, guarantees, reviews, availability, service radius, or ranking claims.\n    If a fact is missing, omit it naturally.\n\n    Write in a warm, professional, human style.\n    Avoid clichés such as:\n    - dedicated professional\n    - top-notch\n    - go-to expert\n    - best in the area\n    - unparalleled service\n    - committed to excellence\n\n    Return JSON matching OUTPUT_SCHEMA.\n\n    FACT_PACK:\n    <FACT_PACK>\n\n    CONTENT_PLAN:\n    <CONTENT_PLAN>\n\n    OUTPUT_SCHEMA:\n    <OUTPUT_SCHEMA>\n\n\nThis is more controllable than:\n\n\n    Write a high-quality profile description.\n\n\n* * *\n\n## Step 5: Validate factuality\n\nAfter generating the description, validate it.\n\nI would start with a combination of:\n\n  1. deterministic checks,\n  2. schema checks,\n  3. LLM-based claim checking,\n  4. sampled human review.\n\n\n\nExample deterministic check:\n\n\n    BANNED_PHRASES = [\n        \"licensed\",\n        \"insured\",\n        \"certified\",\n        \"top-rated\",\n        \"best\",\n        \"guaranteed\",\n        \"same-day\",\n        \"5-star\",\n        \"award-winning\",\n    ]\n\n    def banned_phrase_check(text: str, allowed_claims: list[str]) -> list[str]:\n        violations = []\n        lower_text = text.lower()\n\n        for phrase in BANNED_PHRASES:\n            if phrase in lower_text and not any(phrase in claim.lower() for claim in allowed_claims):\n                violations.append(phrase)\n\n        return violations\n\n\nExample LLM verifier output:\n\n\n    {\n      \"status\": \"fail\",\n      \"unsupported_claims\": [\n        {\n          \"claim\": \"offers same-day service\",\n          \"reason\": \"availability was not present in the input facts\"\n        }\n      ],\n      \"missing_required_facts\": [],\n      \"recommended_action\": \"repair\"\n    }\n\n\nThis is where an evaluator-optimizer pattern becomes useful:\n\n  * writer generates,\n  * verifier checks,\n  * repair model fixes only the invalid parts,\n  * final validator runs again.\n\n\n\nUseful references:\n\n  * Anthropic: Building Effective Agents\n  * Hamel Husain: LLM-as-a-Judge\n  * Pydantic AI output docs\n  * Instructor\n  * Guardrails AI\n\n\n\nImportant caveat: do not blindly trust an LLM judge. Use it as one signal. For critical rules, use deterministic checks too.\n\n* * *\n\n## Step 6: Validate style\n\nThe style checker should not only ask “is this good writing?”\n\nIt should check task-specific failure modes:\n\n  * Does it sound like generic SEO filler?\n  * Does it repeat common marketing clichés?\n  * Is it too similar to the template?\n  * Does it overpromise?\n  * Does it mention unavailable facts?\n  * Is it useful to a real customer?\n\n\n\nExample style checker output:\n\n\n    {\n      \"status\": \"fail\",\n      \"issues\": [\n        {\n          \"type\": \"cliche\",\n          \"span\": \"dedicated professional\",\n          \"reason\": \"overused generic phrase\"\n        },\n        {\n          \"type\": \"thin_content\",\n          \"span\": \"provides quality service for all your needs\",\n          \"reason\": \"generic phrase that adds no profile-specific value\"\n        }\n      ],\n      \"recommended_action\": \"repair\"\n    }\n\n\nRepair prompt:\n\n\n    Revise the description to remove the listed style issues.\n    Do not add new facts.\n    Preserve all valid factual claims.\n    Do not change city, state, service, rate, years of experience, skills, or languages.\n\n    DESCRIPTION:\n    <DESCRIPTION>\n\n    STYLE_ISSUES:\n    <STYLE_ISSUES>\n\n\n* * *\n\n## Step 7: Check duplicates and near-duplicates\n\nFor 10,000+ generated pages, exact duplicates are not the only problem.\n\nYou also need to catch near-duplicates like:\n\n  * same paragraph structure,\n  * same opening line with only city/service swapped,\n  * same conclusion sentence,\n  * same generic claims,\n  * same semantic content in different words.\n\n\n\nI would use multiple layers:\n\n\n    Layer 1:\n      normalized text hash\n\n    Layer 2:\n      n-gram overlap\n\n    Layer 3:\n      embedding similarity\n\n    Layer 4:\n      same city + same service group comparison\n\n    Layer 5:\n      sampled human review\n\n\nFor embedding similarity, pgvector is a practical starting point because it lets you store vectors alongside normal Postgres data.\n\nReference:\n\n  * pgvector\n\n\n\nExample table:\n\n\n    CREATE TABLE profile_outputs (\n        id BIGSERIAL PRIMARY KEY,\n        profile_id TEXT NOT NULL,\n        service TEXT NOT NULL,\n        city TEXT NOT NULL,\n        state TEXT NOT NULL,\n        output_text TEXT NOT NULL,\n        embedding vector(768),\n        model_repo TEXT NOT NULL,\n        model_revision TEXT,\n        prompt_version TEXT NOT NULL,\n        schema_version TEXT NOT NULL,\n        created_at TIMESTAMPTZ DEFAULT now()\n    );\n\n\nExample duplicate check:\n\n\n    SELECT\n        id,\n        profile_id,\n        service,\n        city,\n        output_text,\n        embedding <=> <QUERY_EMBEDDING> AS cosine_distance\n    FROM profile_outputs\n    WHERE service = <SERVICE>\n      AND city = <CITY>\n    ORDER BY embedding <=> <QUERY_EMBEDDING>\n    LIMIT 10;\n\n\nThe exact thresholds need empirical tuning. For example:\n\n\n    if exact_hash_match:\n        reject\n\n    if ngram_overlap > 0.65:\n        regenerate\n\n    if embedding_similarity > 0.92 and same_service:\n        regenerate_with_new_angle\n\n    if embedding_similarity > 0.86 and same_city_same_service:\n        send_to_review\n\n\nThe thresholds above are placeholders, not universal constants.\n\n* * *\n\n## Step 8: Use evals before model selection\n\nI would select models only after defining task-specific evals.\n\nExample eval set:\n\nEval | What it catches | Example failure\n---|---|---\nSchema validity | Invalid JSON or missing fields | no `final_description` field\nFactuality | Claims not in input | “insured” when not provided\nRequired facts | Important facts omitted | city or service missing\nForbidden claims | Risky words or claims | “best”, “certified”, “guaranteed”\nStyle | Generic filler | “for all your needs”\nDuplication | Too similar to existing pages | same paragraph pattern\nHelpfulness | Thin or useless page | no concrete differentiating facts\n\nModel comparison should then look like:\n\n\n    Model A:\n      factuality pass: 94%\n      schema pass: 98%\n      duplicate fail: 12%\n      style fail: 18%\n      average repair attempts: 0.7\n      average latency: 1.2s\n      average output tokens: 170\n\n    Model B:\n      factuality pass: 91%\n      schema pass: 99%\n      duplicate fail: 6%\n      style fail: 10%\n      average repair attempts: 0.5\n      average latency: 1.8s\n      average output tokens: 165\n\n\nThis is much more useful than:\n\n\n    Model A sounds better than Model B in a few examples.\n\n\nUseful references:\n\n  * OpenAI: Evaluation best practices\n  * OpenAI: Working with evals\n  * OpenAI Cookbook: Eval driven system design\n  * Hamel Husain: Your AI Product Needs Evals\n  * Promptfoo\n\n\n\n* * *\n\n## Step 9: Keep model/prompt/schema versions\n\nSave enough metadata to reproduce or debug each output.\n\nMinimum metadata:\n\n\n    {\n      \"profile_id\": \"<PROFILE_ID>\",\n      \"output_id\": \"<OUTPUT_ID>\",\n      \"model_repo\": \"<MODEL_REPO>\",\n      \"model_revision\": \"<MODEL_REVISION>\",\n      \"runtime\": \"vLLM\",\n      \"runtime_version\": \"<VLLM_VERSION>\",\n      \"prompt_version\": \"<PROMPT_VERSION>\",\n      \"schema_version\": \"<SCHEMA_VERSION>\",\n      \"temperature\": 0.4,\n      \"top_p\": 0.9,\n      \"max_tokens\": 500,\n      \"input_hash\": \"<INPUT_HASH>\",\n      \"fact_pack_hash\": \"<FACT_PACK_HASH>\",\n      \"created_at\": \"<TIMESTAMP>\"\n    }\n\n\nFor HF models, I would also pin the model revision or commit when testing and recording results.\n\nReference:\n\n  * Hugging Face Hub: download files and pin revisions\n\n\n\nThis matters because otherwise you cannot answer:\n\n  * Did quality change because the model changed?\n  * Did the prompt change?\n  * Did the input data change?\n  * Did the validation rules change?\n  * Did the runtime change?\n  * Which outputs need regeneration?\n\n\n\n* * *\n\n## Step 10: Be careful with SEO / programmatic content\n\nIf this is for many local-service pages, do not think only about naturalness.\n\nThink about usefulness and uniqueness.\n\nGoogle’s guidance is important here. Google says generative AI can be useful for research and structuring content, but using generative AI or similar tools to generate many pages without adding value for users may violate its scaled content abuse policy.\n\nReferences:\n\n  * Google Search: AI-generated content guidance\n  * Google Search spam policies: scaled content abuse\n  * Google Search: helpful, reliable, people-first content\n  * Google Search update: March 2024\n\n\n\nSo I would not frame the pipeline as:\n\n\n    Generate lots of unique-looking pages.\n\n\nI would frame it as:\n\n\n    Generate useful profile descriptions from real structured facts,\n    reject unsupported claims,\n    detect thin/duplicative pages,\n    and avoid publishing pages that do not add user value.\n\n\nFor programmatic SEO context, these are useful:\n\n  * Ahrefs: Programmatic SEO\n  * Ahrefs: Duplicate content\n\n\n\nFor local-service profile pages, the page should ideally have real differentiators, not just paraphrased boilerplate:\n\n  * service category,\n  * city and state,\n  * actual skills,\n  * years of experience,\n  * real rate information,\n  * real credentials if available,\n  * real languages,\n  * real availability if available,\n  * real review summary if available,\n  * real examples of work if available.\n\n\n\nIf most rows do not contain enough differentiating data, the pipeline should not hide that problem with fluent prose. It should flag those rows as low-information.\n\n* * *\n\n## Suggested implementation path\n\nI would start small.\n\n### Phase 1: Offline evaluation\n\nTake 100–300 representative rows.\n\nInclude edge cases:\n\n  * missing rate,\n  * missing experience,\n  * many skills,\n  * only one skill,\n  * no certifications,\n  * has certification,\n  * multiple languages,\n  * high-overlap rows,\n  * same city and service,\n  * sparse profiles.\n\n\n\nRun 2–4 candidate HF models behind vLLM.\n\nDo not judge only by reading samples. Run evals.\n\nOutputs from this phase:\n\n\n    - prompt v1\n    - fact schema v1\n    - output schema v1\n    - validation rules v1\n    - duplicate thresholds v0\n    - model comparison table\n    - human review notes\n\n\n### Phase 2: MVP backend\n\nBuild:\n\n\n    FastAPI\n    Postgres\n    pgvector\n    Celery + Redis\n    vLLM\n    Pydantic / Instructor\n\n\nCelery is a reasonable MVP queue because it is a mature distributed task queue:\n\n  * Celery documentation\n\n\n\nPostgres + pgvector is enough for initial metadata + vector similarity:\n\n  * pgvector\n\n\n\n### Phase 3: Add repair loops and review queues\n\nAdd statuses like:\n\n\n    pending\n    generating\n    validating\n    repairing\n    duplicate_check\n    review_required\n    approved\n    rejected\n    published\n\n\nAdd separate queues:\n\n\n    generation\n    validation\n    embedding\n    repair\n    export\n\n\nAdd max attempt counts:\n\n\n    max_generation_attempts: 3\n    max_repair_attempts: 2\n    human_review_after: 2 failed repair attempts\n\n\n### Phase 4: Move to durable workflows if needed\n\nIf the workflow becomes more complex, Temporal may be a better fit than Celery for the whole process.\n\nTemporal is useful when you need durable execution, retries, and recovery across long-running workflows:\n\n  * Temporal\n  * Temporal Python SDK error handling\n\n\n\nI would not necessarily start with Temporal if the team wants a quick MVP. But if human review, partial reruns, repair loops, and auditability become central, Temporal becomes attractive.\n\n* * *\n\n## Example pipeline contract\n\nA useful contract is:\n\n\n    The model is allowed to write prose.\n    The application owns facts, rules, validation, retries, and publishing.\n\n\nThat means:\n\n  * the model does not decide whether a claim is allowed;\n  * the model does not decide whether a page is publishable;\n  * the model does not decide whether two pages are too similar;\n  * the model does not silently change the data contract;\n  * the model does not erase metadata needed for debugging.\n\n\n\nThe app should own those things.\n\n* * *\n\n## Example prompt template\n\n\n    SYSTEM:\n    You write local service marketplace profile descriptions.\n\n    Hard rules:\n    - Use only the facts in FACT_PACK.\n    - Do not invent credentials, insurance, certifications, awards, reviews, rankings, guarantees, service areas, availability, or business history.\n    - If a fact is missing, omit it naturally.\n    - Avoid generic SEO filler.\n    - Avoid clichés.\n    - Keep the description useful to a real customer comparing providers.\n\n    Return JSON matching OUTPUT_SCHEMA.\n\n    FACT_PACK:\n    <FACT_PACK>\n\n    CONTENT_PLAN:\n    <CONTENT_PLAN>\n\n    OUTPUT_SCHEMA:\n    <OUTPUT_SCHEMA>\n\n\nExample output schema:\n\n\n    {\n      \"type\": \"object\",\n      \"properties\": {\n        \"final_description\": {\n          \"type\": \"string\"\n        },\n        \"included_facts\": {\n          \"type\": \"array\",\n          \"items\": {\"type\": \"string\"}\n        },\n        \"unsupported_claims\": {\n          \"type\": \"array\",\n          \"items\": {\"type\": \"string\"}\n        },\n        \"style_notes\": {\n          \"type\": \"array\",\n          \"items\": {\"type\": \"string\"}\n        }\n      },\n      \"required\": [\n        \"final_description\",\n        \"included_facts\",\n        \"unsupported_claims\"\n      ]\n    }\n\n\n* * *\n\n## Example validator contract\n\n\n    {\n      \"factuality\": {\n        \"status\": \"pass\",\n        \"unsupported_claims\": []\n      },\n      \"forbidden_claims\": {\n        \"status\": \"pass\",\n        \"violations\": []\n      },\n      \"style\": {\n        \"status\": \"fail\",\n        \"issues\": [\n          \"Contains generic phrase: 'for all your needs'\"\n        ]\n      },\n      \"duplicate\": {\n        \"status\": \"pass\",\n        \"nearest_output_id\": \"<OUTPUT_ID>\",\n        \"similarity\": 0.78\n      },\n      \"recommended_action\": \"repair\"\n    }\n\n\nThis kind of object is much easier to debug than a plain paragraph.\n\n* * *\n\n## Model choice\n\nFor the writer model, I would shortlist a few reputable HF models that run well under vLLM and evaluate them with the above pipeline.\n\nI would not choose based only on public chat benchmarks.\n\nI would choose based on:\n\n  * schema pass rate,\n  * factuality pass rate,\n  * repair rate,\n  * duplicate rate,\n  * style pass rate,\n  * latency,\n  * throughput,\n  * cost,\n  * operational stability.\n\n\n\nThe best model for this pipeline is the one that produces the highest rate of valid, useful, non-duplicative outputs after the full validation pipeline, not necessarily the one that writes the most impressive one-off paragraph.\n\n* * *\n\n## What I would avoid\n\nI would avoid this:\n\n\n    One API endpoint:\n      input row → prompt → final paragraph → publish\n\n\nIt is too hard to debug and too easy to scale mistakes.\n\nI would also avoid:\n\n\n    Pick a strong model and trust the prompt.\n\n\nPrompts are important, but prompts are not enforcement.\n\nI would avoid publishing all generated outputs automatically before you have at least:\n\n  * factuality validation,\n  * banned-claim checks,\n  * duplicate checks,\n  * evals,\n  * sampled human review,\n  * versioned logs.\n\n\n\n* * *\n\n## Practical minimal version\n\nIf you want a minimal version, I would build this first:\n\n\n    1. CSV or database rows\n    2. normalize into Pydantic schema\n    3. create fact pack\n    4. call vLLM writer model\n    5. validate JSON output\n    6. run banned-phrase checks\n    7. run LLM factuality verifier\n    8. embed final text\n    9. check nearest neighbors in pgvector\n    10. save output + validation metadata\n    11. export approved rows\n\n\nThis is already much safer than one-shot generation.\n\n* * *\n\n## Final recommendation\n\nI would use vLLM as the serving layer and keep HF models interchangeable.\n\nThen I would invest most of the effort in:\n\n  * input normalization,\n  * fact packs,\n  * structured outputs,\n  * validation,\n  * repair loops,\n  * duplicate detection,\n  * evals,\n  * audit logs,\n  * conservative publishing rules.\n\n\n\nThat makes the system more robust than trying to find one magic model.\n\nThe model matters, but the pipeline matters more.\n\nA good model inside a weak pipeline will still hallucinate, duplicate, and drift.\n\nA decent model inside a strong pipeline can be measured, repaired, compared, and replaced.",
  "title": "Need generative model, high-quality description generation"
}