External Publication
Visit Post

Need generative model, high-quality description generation

Hugging Face Forums [Unofficial] May 27, 2026
Source

If you don’t insist on having the LLM complete everything by itself, this may be simpler:


Short answer

I would treat this less as a “find the perfect generative model” problem and more as a pipeline design problem.

For this kind of description-generation task, I would probably use vLLM as the inference backend, run one or more reputable Hugging Face models behind it, and put most of the engineering effort into the surrounding pipeline:

  1. normalize the structured input,
  2. build a fact pack,
  3. generate a content plan,
  4. generate the description,
  5. validate factuality,
  6. validate style and banned claims,
  7. check near-duplicates,
  8. repair or regenerate,
  9. log model/prompt/schema versions and eval results.

The model still matters, of course. But if the task is to generate thousands of profile/location/service descriptions, the main risk is usually not “the paragraph is not poetic enough.” The main risks are:

  • unsupported facts,
  • generic filler,
  • near-duplicate pages,
  • unsafe claims,
  • SEO-thin pages,
  • inability to compare model/prompt changes later.

So I would keep the model swappable and make the pipeline the main product.

Useful references:

  • vLLM OpenAI-compatible server
  • vLLM structured outputs
  • Anthropic: Building Effective Agents
  • OpenAI: Evaluation best practices
  • OpenAI: Working with evals
  • OpenAI Cookbook: Eval driven system design
  • Google Search: AI-generated content guidance
  • Google Search spam policies: scaled content abuse

Why I would not optimize only for “the best model”

There are many decent open models on Hugging Face now. Some Qwen, Llama, Mistral, Gemma, and Command-family models can produce good profile or marketing prose.

But for this use case, a better model alone does not solve the main operational problems.

A stronger model may still:

  • hallucinate credentials,
  • add unsupported service areas,
  • overstate experience,
  • invent availability,
  • invent review quality,
  • produce generic SEO-ish filler,
  • repeat similar sentence structures across thousands of pages,
  • silently change behavior after a model, prompt, or runtime update,
  • produce good-looking text that fails a business rule.

That is why I would avoid a pure “model shootout” approach.

A model shootout is still useful, but only after defining task-specific evals. General benchmark strength is not the same as quality on this exact task.

OpenAI’s eval guidance is useful here because it frames evals as a way to test AI systems despite generative variability:

  • OpenAI: Evaluation best practices
  • OpenAI: Working with evals
  • OpenAI Evals GitHub

Hamel Husain’s writing is also useful from a practical engineering point of view:

  • Your AI Product Needs Evals
  • Creating a LLM-as-a-Judge That Drives Business Results

Chip Huyen’s production LLM article is also a good reference for the idea that LLM applications should be tested as systems, not just prompts:

  • Building LLM applications for production

The short version:

Do not ask “which model writes the nicest description?” first. Ask “which pipeline reliably turns structured facts into useful, factual, non-duplicative descriptions?”


Proposed backend shape

I would use this architecture:

Admin / API
  ↓
FastAPI
  ↓
Postgres
  ↓
Celery or Temporal
  ↓
Workers
  ├─ normalize_input
  ├─ build_fact_pack
  ├─ generate_content_plan
  ├─ generate_description
  ├─ fact_check
  ├─ style_check
  ├─ duplicate_check
  ├─ repair_or_regenerate
  └─ publish_or_export
        ↓
vLLM OpenAI-compatible server
        ↓
HF model weights

Suggested starting stack:

Inference:
  vLLM

API:
  FastAPI

Database:
  Postgres

Vector similarity:
  pgvector

Queue / jobs:
  Celery + Redis for MVP
  Temporal later if workflows become complex

Validation:
  Pydantic
  Instructor or similar structured-output helper

Storage:
  S3 / R2 / MinIO

Monitoring:
  structured logs
  token/latency/cost counters
  eval dashboards

Why vLLM?

vLLM gives you an OpenAI-compatible HTTP server, which makes it easier to keep your application code stable while swapping the underlying HF model:

  • vLLM OpenAI-compatible server

It also supports structured outputs, which is useful if you want the model to return a schema like this:

{
  "content_plan": {
    "angle": "experienced bilingual local technician",
    "paragraphs": [
      "Introduce the service and location",
      "Mention supported skills and experience",
      "Close with practical customer benefit"
    ]
  },
  "included_facts": [
    "Austin, TX",
    "7 years of experience",
    "washer repair",
    "dryer repair",
    "$85/hour"
  ],
  "unsupported_claims": [],
  "final_description": "<generated description>"
}

Reference:

  • vLLM structured outputs

The point is not that structured output magically guarantees truth. It does not. The point is that it gives the rest of your application something inspectable.


Why a pipeline fits this task better than one-shot generation

This task is a good match for a fixed workflow.

Anthropic’s “Building Effective Agents” post is useful here because it separates relatively deterministic workflows from more open-ended agents. In particular, it describes:

  • prompt chaining,
  • routing,
  • parallelization,
  • orchestrator-workers,
  • evaluator-optimizer.

Reference:

  • Anthropic: Building Effective Agents

For this problem, I would use something closer to prompt chaining and evaluator-optimizer , not a fully autonomous agent.

A simple generation pipeline might look like this:

Raw row
  ↓
Normalized facts
  ↓
Fact pack
  ↓
Content plan
  ↓
Draft description
  ↓
Factuality check
  ↓
Style / banned-claim check
  ↓
Duplicate check
  ↓
Repair or regenerate
  ↓
Approved output

That is easier to test than a giant prompt that says:

Write a unique, high-quality, SEO-friendly, factual local service description.

The giant prompt may work for 20 examples. It is much less safe for 10,000+ examples.


Step 1: Normalize the input first

Before calling the LLM, normalize the input into a strict schema.

Example:

{
  "profile_id": "<PROFILE_ID>",
  "service": "appliance repair",
  "city": "Austin",
  "state": "TX",
  "rate": {
    "amount": 85,
    "currency": "USD",
    "unit": "hour"
  },
  "experience_years": 7,
  "skills": [
    "washer repair",
    "dryer repair",
    "refrigerator diagnostics"
  ],
  "languages": [
    "English",
    "Spanish"
  ],
  "certifications": [],
  "insurance": null,
  "reviews_summary": null
}

This is not just cleanup. It prevents the model from guessing what missing fields mean.

For example:

  • if certifications is empty, do not allow “certified”;
  • if insurance is null, do not allow “insured”;
  • if reviews_summary is null, do not allow “highly reviewed” or “5-star”;
  • if no availability is provided, do not allow “same-day service”;
  • if no service radius is provided, do not invent nearby cities.

The LLM should receive not only the raw facts but also the allowed and forbidden claims.


Step 2: Build a fact pack

I would explicitly build a fact pack before writing.

Example:

{
  "allowed_claims": [
    "The provider offers appliance repair in Austin, TX.",
    "The provider has 7 years of experience.",
    "The provider handles washer repair, dryer repair, and refrigerator diagnostics.",
    "The provider speaks English and Spanish.",
    "The listed rate is $85/hour."
  ],
  "forbidden_claims": [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best in Austin",
    "guaranteed same-day service",
    "5-star reviews",
    "background checked",
    "family-owned",
    "emergency service"
  ],
  "missing_fields": [
    "certifications",
    "insurance",
    "reviews",
    "availability",
    "service_radius"
  ]
}

This makes the generation task much easier:

Write a description using only these allowed claims. Do not use any forbidden claims. Omit missing facts naturally.

This is also useful for auditing later.

If a generated page says “insured”, you can check whether insured was ever present in the fact pack. If it was not, the output is invalid.


Step 3: Generate a content plan before final prose

Instead of asking for the final description immediately, ask the model to make a small plan.

Example output:

{
  "angle": "practical local appliance repair help",
  "paragraph_plan": [
    {
      "goal": "Introduce service, location, and main skills",
      "facts_to_use": ["service", "city", "state", "skills"]
    },
    {
      "goal": "Mention experience and rate without sounding salesy",
      "facts_to_use": ["experience_years", "rate"]
    },
    {
      "goal": "Close with a customer-oriented sentence",
      "facts_to_use": ["languages"]
    }
  ],
  "style_constraints": [
    "professional",
    "plainspoken",
    "no exaggerated marketing claims",
    "no unsupported credentials"
  ]
}

This intermediate step gives you something to validate before prose generation.

If the plan already includes “certified technician” but the fact pack has no certification, reject the plan before generating the final text.


Step 4: Generate the description

Then generate the actual description.

Example prompt shape:

You write local service marketplace profile descriptions.

Use ONLY the facts in FACT_PACK.
Do not invent credentials, awards, insurance, guarantees, reviews, availability, service radius, or ranking claims.
If a fact is missing, omit it naturally.

Write in a warm, professional, human style.
Avoid clichés such as:
- dedicated professional
- top-notch
- go-to expert
- best in the area
- unparalleled service
- committed to excellence

Return JSON matching OUTPUT_SCHEMA.

FACT_PACK:
<FACT_PACK>

CONTENT_PLAN:
<CONTENT_PLAN>

OUTPUT_SCHEMA:
<OUTPUT_SCHEMA>

This is more controllable than:

Write a high-quality profile description.

Step 5: Validate factuality

After generating the description, validate it.

I would start with a combination of:

  1. deterministic checks,
  2. schema checks,
  3. LLM-based claim checking,
  4. sampled human review.

Example deterministic check:

BANNED_PHRASES = [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best",
    "guaranteed",
    "same-day",
    "5-star",
    "award-winning",
]

def banned_phrase_check(text: str, allowed_claims: list[str]) -> list[str]:
    violations = []
    lower_text = text.lower()

    for phrase in BANNED_PHRASES:
        if phrase in lower_text and not any(phrase in claim.lower() for claim in allowed_claims):
            violations.append(phrase)

    return violations

Example LLM verifier output:

{
  "status": "fail",
  "unsupported_claims": [
    {
      "claim": "offers same-day service",
      "reason": "availability was not present in the input facts"
    }
  ],
  "missing_required_facts": [],
  "recommended_action": "repair"
}

This is where an evaluator-optimizer pattern becomes useful:

  • writer generates,
  • verifier checks,
  • repair model fixes only the invalid parts,
  • final validator runs again.

Useful references:

  • Anthropic: Building Effective Agents
  • Hamel Husain: LLM-as-a-Judge
  • Pydantic AI output docs
  • Instructor
  • Guardrails AI

Important caveat: do not blindly trust an LLM judge. Use it as one signal. For critical rules, use deterministic checks too.


Step 6: Validate style

The style checker should not only ask “is this good writing?”

It should check task-specific failure modes:

  • Does it sound like generic SEO filler?
  • Does it repeat common marketing clichés?
  • Is it too similar to the template?
  • Does it overpromise?
  • Does it mention unavailable facts?
  • Is it useful to a real customer?

Example style checker output:

{
  "status": "fail",
  "issues": [
    {
      "type": "cliche",
      "span": "dedicated professional",
      "reason": "overused generic phrase"
    },
    {
      "type": "thin_content",
      "span": "provides quality service for all your needs",
      "reason": "generic phrase that adds no profile-specific value"
    }
  ],
  "recommended_action": "repair"
}

Repair prompt:

Revise the description to remove the listed style issues.
Do not add new facts.
Preserve all valid factual claims.
Do not change city, state, service, rate, years of experience, skills, or languages.

DESCRIPTION:
<DESCRIPTION>

STYLE_ISSUES:
<STYLE_ISSUES>

Step 7: Check duplicates and near-duplicates

For 10,000+ generated pages, exact duplicates are not the only problem.

You also need to catch near-duplicates like:

  • same paragraph structure,
  • same opening line with only city/service swapped,
  • same conclusion sentence,
  • same generic claims,
  • same semantic content in different words.

I would use multiple layers:

Layer 1:
  normalized text hash

Layer 2:
  n-gram overlap

Layer 3:
  embedding similarity

Layer 4:
  same city + same service group comparison

Layer 5:
  sampled human review

For embedding similarity, pgvector is a practical starting point because it lets you store vectors alongside normal Postgres data.

Reference:

  • pgvector

Example table:

CREATE TABLE profile_outputs (
    id BIGSERIAL PRIMARY KEY,
    profile_id TEXT NOT NULL,
    service TEXT NOT NULL,
    city TEXT NOT NULL,
    state TEXT NOT NULL,
    output_text TEXT NOT NULL,
    embedding vector(768),
    model_repo TEXT NOT NULL,
    model_revision TEXT,
    prompt_version TEXT NOT NULL,
    schema_version TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

Example duplicate check:

SELECT
    id,
    profile_id,
    service,
    city,
    output_text,
    embedding <=> <QUERY_EMBEDDING> AS cosine_distance
FROM profile_outputs
WHERE service = <SERVICE>
  AND city = <CITY>
ORDER BY embedding <=> <QUERY_EMBEDDING>
LIMIT 10;

The exact thresholds need empirical tuning. For example:

if exact_hash_match:
    reject

if ngram_overlap > 0.65:
    regenerate

if embedding_similarity > 0.92 and same_service:
    regenerate_with_new_angle

if embedding_similarity > 0.86 and same_city_same_service:
    send_to_review

The thresholds above are placeholders, not universal constants.


Step 8: Use evals before model selection

I would select models only after defining task-specific evals.

Example eval set:

Eval What it catches Example failure
Schema validity Invalid JSON or missing fields no final_description field
Factuality Claims not in input “insured” when not provided
Required facts Important facts omitted city or service missing
Forbidden claims Risky words or claims “best”, “certified”, “guaranteed”
Style Generic filler “for all your needs”
Duplication Too similar to existing pages same paragraph pattern
Helpfulness Thin or useless page no concrete differentiating facts

Model comparison should then look like:

Model A:
  factuality pass: 94%
  schema pass: 98%
  duplicate fail: 12%
  style fail: 18%
  average repair attempts: 0.7
  average latency: 1.2s
  average output tokens: 170

Model B:
  factuality pass: 91%
  schema pass: 99%
  duplicate fail: 6%
  style fail: 10%
  average repair attempts: 0.5
  average latency: 1.8s
  average output tokens: 165

This is much more useful than:

Model A sounds better than Model B in a few examples.

Useful references:

  • OpenAI: Evaluation best practices
  • OpenAI: Working with evals
  • OpenAI Cookbook: Eval driven system design
  • Hamel Husain: Your AI Product Needs Evals
  • Promptfoo

Step 9: Keep model/prompt/schema versions

Save enough metadata to reproduce or debug each output.

Minimum metadata:

{
  "profile_id": "<PROFILE_ID>",
  "output_id": "<OUTPUT_ID>",
  "model_repo": "<MODEL_REPO>",
  "model_revision": "<MODEL_REVISION>",
  "runtime": "vLLM",
  "runtime_version": "<VLLM_VERSION>",
  "prompt_version": "<PROMPT_VERSION>",
  "schema_version": "<SCHEMA_VERSION>",
  "temperature": 0.4,
  "top_p": 0.9,
  "max_tokens": 500,
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "created_at": "<TIMESTAMP>"
}

For HF models, I would also pin the model revision or commit when testing and recording results.

Reference:

  • Hugging Face Hub: download files and pin revisions

This matters because otherwise you cannot answer:

  • Did quality change because the model changed?
  • Did the prompt change?
  • Did the input data change?
  • Did the validation rules change?
  • Did the runtime change?
  • Which outputs need regeneration?

Step 10: Be careful with SEO / programmatic content

If this is for many local-service pages, do not think only about naturalness.

Think about usefulness and uniqueness.

Google’s guidance is important here. Google says generative AI can be useful for research and structuring content, but using generative AI or similar tools to generate many pages without adding value for users may violate its scaled content abuse policy.

References:

  • Google Search: AI-generated content guidance
  • Google Search spam policies: scaled content abuse
  • Google Search: helpful, reliable, people-first content
  • Google Search update: March 2024

So I would not frame the pipeline as:

Generate lots of unique-looking pages.

I would frame it as:

Generate useful profile descriptions from real structured facts,
reject unsupported claims,
detect thin/duplicative pages,
and avoid publishing pages that do not add user value.

For programmatic SEO context, these are useful:

  • Ahrefs: Programmatic SEO
  • Ahrefs: Duplicate content

For local-service profile pages, the page should ideally have real differentiators, not just paraphrased boilerplate:

  • service category,
  • city and state,
  • actual skills,
  • years of experience,
  • real rate information,
  • real credentials if available,
  • real languages,
  • real availability if available,
  • real review summary if available,
  • real examples of work if available.

If most rows do not contain enough differentiating data, the pipeline should not hide that problem with fluent prose. It should flag those rows as low-information.


Suggested implementation path

I would start small.

Phase 1: Offline evaluation

Take 100–300 representative rows.

Include edge cases:

  • missing rate,
  • missing experience,
  • many skills,
  • only one skill,
  • no certifications,
  • has certification,
  • multiple languages,
  • high-overlap rows,
  • same city and service,
  • sparse profiles.

Run 2–4 candidate HF models behind vLLM.

Do not judge only by reading samples. Run evals.

Outputs from this phase:

- prompt v1
- fact schema v1
- output schema v1
- validation rules v1
- duplicate thresholds v0
- model comparison table
- human review notes

Phase 2: MVP backend

Build:

FastAPI
Postgres
pgvector
Celery + Redis
vLLM
Pydantic / Instructor

Celery is a reasonable MVP queue because it is a mature distributed task queue:

  • Celery documentation

Postgres + pgvector is enough for initial metadata + vector similarity:

  • pgvector

Phase 3: Add repair loops and review queues

Add statuses like:

pending
generating
validating
repairing
duplicate_check
review_required
approved
rejected
published

Add separate queues:

generation
validation
embedding
repair
export

Add max attempt counts:

max_generation_attempts: 3
max_repair_attempts: 2
human_review_after: 2 failed repair attempts

Phase 4: Move to durable workflows if needed

If the workflow becomes more complex, Temporal may be a better fit than Celery for the whole process.

Temporal is useful when you need durable execution, retries, and recovery across long-running workflows:

  • Temporal
  • Temporal Python SDK error handling

I would not necessarily start with Temporal if the team wants a quick MVP. But if human review, partial reruns, repair loops, and auditability become central, Temporal becomes attractive.


Example pipeline contract

A useful contract is:

The model is allowed to write prose.
The application owns facts, rules, validation, retries, and publishing.

That means:

  • the model does not decide whether a claim is allowed;
  • the model does not decide whether a page is publishable;
  • the model does not decide whether two pages are too similar;
  • the model does not silently change the data contract;
  • the model does not erase metadata needed for debugging.

The app should own those things.


Example prompt template

SYSTEM:
You write local service marketplace profile descriptions.

Hard rules:
- Use only the facts in FACT_PACK.
- Do not invent credentials, insurance, certifications, awards, reviews, rankings, guarantees, service areas, availability, or business history.
- If a fact is missing, omit it naturally.
- Avoid generic SEO filler.
- Avoid clichés.
- Keep the description useful to a real customer comparing providers.

Return JSON matching OUTPUT_SCHEMA.

FACT_PACK:
<FACT_PACK>

CONTENT_PLAN:
<CONTENT_PLAN>

OUTPUT_SCHEMA:
<OUTPUT_SCHEMA>

Example output schema:

{
  "type": "object",
  "properties": {
    "final_description": {
      "type": "string"
    },
    "included_facts": {
      "type": "array",
      "items": {"type": "string"}
    },
    "unsupported_claims": {
      "type": "array",
      "items": {"type": "string"}
    },
    "style_notes": {
      "type": "array",
      "items": {"type": "string"}
    }
  },
  "required": [
    "final_description",
    "included_facts",
    "unsupported_claims"
  ]
}

Example validator contract

{
  "factuality": {
    "status": "pass",
    "unsupported_claims": []
  },
  "forbidden_claims": {
    "status": "pass",
    "violations": []
  },
  "style": {
    "status": "fail",
    "issues": [
      "Contains generic phrase: 'for all your needs'"
    ]
  },
  "duplicate": {
    "status": "pass",
    "nearest_output_id": "<OUTPUT_ID>",
    "similarity": 0.78
  },
  "recommended_action": "repair"
}

This kind of object is much easier to debug than a plain paragraph.


Model choice

For the writer model, I would shortlist a few reputable HF models that run well under vLLM and evaluate them with the above pipeline.

I would not choose based only on public chat benchmarks.

I would choose based on:

  • schema pass rate,
  • factuality pass rate,
  • repair rate,
  • duplicate rate,
  • style pass rate,
  • latency,
  • throughput,
  • cost,
  • operational stability.

The best model for this pipeline is the one that produces the highest rate of valid, useful, non-duplicative outputs after the full validation pipeline, not necessarily the one that writes the most impressive one-off paragraph.


What I would avoid

I would avoid this:

One API endpoint:
  input row → prompt → final paragraph → publish

It is too hard to debug and too easy to scale mistakes.

I would also avoid:

Pick a strong model and trust the prompt.

Prompts are important, but prompts are not enforcement.

I would avoid publishing all generated outputs automatically before you have at least:

  • factuality validation,
  • banned-claim checks,
  • duplicate checks,
  • evals,
  • sampled human review,
  • versioned logs.

Practical minimal version

If you want a minimal version, I would build this first:

1. CSV or database rows
2. normalize into Pydantic schema
3. create fact pack
4. call vLLM writer model
5. validate JSON output
6. run banned-phrase checks
7. run LLM factuality verifier
8. embed final text
9. check nearest neighbors in pgvector
10. save output + validation metadata
11. export approved rows

This is already much safer than one-shot generation.


Final recommendation

I would use vLLM as the serving layer and keep HF models interchangeable.

Then I would invest most of the effort in:

  • input normalization,
  • fact packs,
  • structured outputs,
  • validation,
  • repair loops,
  • duplicate detection,
  • evals,
  • audit logs,
  • conservative publishing rules.

That makes the system more robust than trying to find one magic model.

The model matters, but the pipeline matters more.

A good model inside a weak pipeline will still hallucinate, duplicate, and drift.

A decent model inside a strong pipeline can be measured, repaired, compared, and replaced.

Discussion in the ATmosphere

Loading comments...