External Publication
Visit Post

Need generative model, high-quality description generation

Hugging Face Forums [Unofficial] May 28, 2026
Source

Oh. If your existing production stack is already mostly settled, you can safely treat my earlier vLLM comments as just a from-scratch architecture example and skip that part. The more important point is this: if you use raw LLM responses directly, it is hard to keep quality stable at scale. In many cases, the basic pattern is to put a layer between the model output and the published page — usually by having the model produce structured output first:


Short version

If I were building this, I would not start by replacing your backend.

You already have:

  • PostgreSQL
  • Java / Spring Boot
  • React / Node.js
  • AWS hosting
  • a production app
  • fixed operator-page UI sections
  • an API-based plan using OpenRouter or similar providers

So I would keep that stack and add an asynchronous profile content lifecycle around it.

The core flow would be:

operator data
  ↓
normalized facts
  ↓
fact pack
  ↓
structured generation
  ↓
validation report
  ↓
duplicate / SEO quality checks
  ↓
repair or review
  ↓
public-unverified / verified / published content

The model writes prose. The application owns facts, consistency, validation, duplicate detection, and publishing decisions.

That distinction matters. Even a very capable LLM can produce good-looking but invalid text if the raw output is used directly. I would treat model output as a draft, not as the production artifact.

Useful references:

  • OpenRouter structured outputs
  • OpenRouter API reference
  • Anthropic: Building Effective Agents
  • OpenAI: Working with evals
  • OpenAI: Evaluation best practices
  • Google Search: AI-generated content guidance
  • Google Search spam policies: scaled content abuse

1. Keep the backend, make the LLM provider an adapter

I would not move to a new backend unless there is a strong reason.

Spring Boot can remain the source of truth. PostgreSQL can store raw operator data, generation jobs, generated versions, validation results, review states, and publication states.

The LLM provider should be an adapter:

interface ProfileGenerationClient {
    GeneratedProfile generate(ProfileFactPack factPack, GenerationConfig config);
}

Initial implementation:

OpenRouterProfileGenerationClient

Possible future implementations:

DirectProviderClient
InternalFineTunedModelClient
SelfHostedModelClient

For every generation, I would store metadata:

{
  "provider": "openrouter",
  "requested_model": "<MODEL_ID>",
  "resolved_model": "<RESOLVED_MODEL_IF_AVAILABLE>",
  "prompt_version": "profile_prompt_v7",
  "schema_version": "operator_profile_schema_v3",
  "fact_pack_version": "fact_pack_v2",
  "temperature": 0.3,
  "max_tokens": 1200,
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "output_hash": "<OUTPUT_HASH>"
}

Without this, it becomes difficult to debug quality changes later.


2. Start with the content contract

Before prompt engineering, I would define the exact output contract.

Since your UI is fixed, the model should not return arbitrary prose. It should return structured content for your fixed sections.

Example:

{
  "bio": "...",
  "services_offered": [
    {
      "name": "...",
      "description": "...",
      "source_fact_ids": ["skill_12", "category_3"]
    }
  ],
  "service_areas": [
    {
      "name": "Austin, TX",
      "source_fact_ids": ["location_primary"]
    }
  ],
  "faqs": [
    {
      "question": "...",
      "answer": "...",
      "source_fact_ids": ["skill_12", "rate_1"]
    }
  ],
  "seo": {
    "title": "...",
    "meta_description": "..."
  },
  "claims_used": [
    {
      "claim": "The operator provides appliance repair in Austin, TX.",
      "source_fact_ids": ["category_3", "location_primary"]
    }
  ],
  "unsupported_claims": [],
  "risk_flags": []
}

The important part is source_fact_ids.

The model should not only write text. It should say which input facts support the generated claim. That makes downstream validation much easier.

OpenRouter structured outputs can help enforce the response shape:

  • OpenRouter structured outputs

But structured output is not the same as factual output.

This can be valid JSON and still be business-invalid:

{
  "bio": "Austin-based certified appliance repair specialist with same-day service.",
  "claims_used": ["certified", "same-day service"],
  "unsupported_claims": []
}

If the operator did not provide certification or availability facts, that content should be rejected even if the JSON is valid.

So I would split validation into:

JSON/schema validation:
  checks shape

business validation:
  checks factuality, forbidden claims, duplicates, SEO risk, and publishability

3. Build a fact pack before generation

I would not send the raw operator record directly to the model.

Convert raw operator data into a fact pack first.

Example:

{
  "operator_id": "op_123",
  "allowed_facts": [
    {
      "id": "service_primary",
      "type": "service",
      "value": "appliance repair"
    },
    {
      "id": "location_primary",
      "type": "location",
      "value": "Austin, TX"
    },
    {
      "id": "experience_years",
      "type": "experience",
      "value": 7
    },
    {
      "id": "skill_1",
      "type": "skill",
      "value": "washer repair"
    }
  ],
  "forbidden_claims": [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best",
    "guaranteed",
    "same-day service",
    "24/7 emergency service",
    "5-star reviews"
  ],
  "missing_fact_classes": [
    "insurance",
    "certifications",
    "reviews",
    "availability",
    "service_radius"
  ],
  "content_limits": {
    "max_bio_words": 140,
    "max_faq_count": 2,
    "allow_faq": true
  }
}

Missing data should become explicit constraints.

For example:

insurance = null

should become:

Do not claim insured.

And:

reviews_summary = null

should become:

Do not claim highly reviewed, 5-star, top-rated, or customer-loved.

The model should not decide what missing data means. The application should decide.


4. Use a multi-step generation flow

I would avoid this:

input row → one prompt → final paragraph → publish

That is fragile at scale.

I would use a workflow:

1. Normalize operator input
2. Build fact pack
3. Decide content policy
4. Generate content plan
5. Validate content plan
6. Generate structured profile JSON
7. Validate schema
8. Validate factuality
9. Validate forbidden claims
10. Validate SEO/content quality
11. Check duplicate / near-duplicate risk
12. Repair or regenerate
13. Decide publishing state
14. Store content version + validation report

This is close to the workflow patterns described by Anthropic, especially prompt chaining and evaluator-optimizer:

  • Anthropic: Building Effective Agents

The model should not own the whole workflow.

The model can write the words. The application should decide what is allowed, what is invalid, what needs review, and what can be published.


5. Insert a content-plan step

Before final content generation, I would ask for a plan.

Example:

{
  "bio_plan": {
    "angle": "practical local appliance repair help",
    "facts_to_use": [
      "service_primary",
      "location_primary",
      "experience_years",
      "skill_1"
    ],
    "facts_to_avoid": [
      "insurance",
      "certifications",
      "reviews",
      "availability"
    ]
  },
  "faq_plan": [
    {
      "question_type": "service_scope",
      "source_fact_ids": ["service_primary", "skill_1"]
    }
  ],
  "skip_sections": [
    {
      "section": "certifications",
      "reason": "no certification facts were provided"
    }
  ]
}

Then validate the plan before generating final copy.

If the plan already includes:

certified technician
same-day service
top-rated
5-star reviews

and those facts are not in the fact pack, reject the plan before the final content is generated.


6. Store validation reports

For every generated profile, I would store a validation report.

Example:

{
  "schema": {
    "status": "pass",
    "errors": []
  },
  "factuality": {
    "status": "fail",
    "unsupported_claims": [
      {
        "claim": "insured",
        "reason": "insurance was not present in the fact pack"
      }
    ]
  },
  "forbidden_claims": {
    "status": "pass",
    "violations": []
  },
  "seo_quality": {
    "status": "warn",
    "issues": [
      "FAQ answer is generic",
      "bio uses low-specificity wording"
    ]
  },
  "duplication": {
    "status": "pass",
    "nearest_profile_id": "op_987",
    "similarity": 0.78
  },
  "decision": "repair"
}

This report is useful for:

  • debugging failed generations
  • explaining why a profile went to review
  • improving prompts
  • comparing models
  • building future evals
  • creating future fine-tuning or preference data

Without validation reports, you only have “the model wrote something.” With validation reports, you have a system you can improve.


7. Separate generated, public-unverified, verified, and published

I would not use READY to mean “trusted.”

I would separate these states:

State Meaning
GENERATED_READY Generated and passed automated checks
PUBLIC_UNVERIFIED Publicly visible, but not manually/proof verified
VERIFIED Important operator facts have been verified
REVIEW_REQUIRED Should not be auto-published
PUBLISHED Currently rendered on the live page

The key distinction:

generated != verified

Your idea of generating quickly and adding a verified tag later is reasonable. I would just make that distinction explicit in the data model and UI.

A profile can be generated in 1–2 minutes and shown as public-unverified. It can become verified later after proof, human review, or platform verification.


8. Use risk-based review, not full human-in-the-loop

I would not review every generated profile before publication unless the category is sensitive or legally risky.

Full human-in-the-loop can be too slow for onboarding.

Instead:

Auto-publish as PUBLIC_UNVERIFIED if:
  - schema is valid
  - no unsupported claims
  - no forbidden claims
  - duplicate score is low
  - fact density is sufficient
  - no suspicious operator patterns
  - no high-risk service category

Send to review if:

REVIEW_REQUIRED if:
  - unsupported claims were detected
  - forbidden claims were detected
  - duplicate similarity is high
  - sparse input produced long output
  - repeated repair attempts failed
  - operator data looks suspicious
  - service category is high risk

This keeps onboarding fast while still protecting quality.


9. Treat SEO quality as a policy, not a prompt phrase

I would avoid making the main instruction:

Write SEO-friendly content.

That can produce filler, keyword stuffing, and city/service boilerplate.

I would define the target as:

useful, fact-grounded, operator-specific, non-duplicative content

Relevant Google references:

  • Google Search: AI-generated content guidance
  • Google Search spam policies: scaled content abuse
  • Google Search: helpful, reliable, people-first content
  • Google Search update: March 2024

The risk is not “AI wrote it.” The risk is generating many low-value, near-duplicate, weakly grounded pages.

SEO/content quality gate:

- Does this profile contain enough operator-specific facts?
- Are service areas supported by input data?
- Are FAQs grounded in actual facts?
- Is the title/meta keyword-stuffed?
- Is this page too similar to other city/service pages?
- Is this sparse profile being inflated into a long page?
- Should this page be short, noindex, or review-required until more facts are collected?

Most important rule:

Sparse inputs should produce short profiles, not inflated pages.

If the operator only provides a city and one service, do not generate a long bio and five FAQs. That creates both hallucination risk and SEO risk.


10. Measure uniqueness instead of asking for it

I would not rely on this instruction:

Write a unique description.

I would measure uniqueness.

Layer Check
1 normalized text hash
2 repeated phrase / sentence pattern
3 n-gram overlap
4 embedding similarity
5 same-city + same-service comparison
6 operator-data duplicate detection

Since you already use PostgreSQL, pgvector is a practical option for vector similarity search.

Example:

SELECT
    id,
    operator_id,
    service,
    city,
    embedding <=> <QUERY_EMBEDDING> AS cosine_distance
FROM operator_profile_versions
WHERE service = <SERVICE>
  AND city = <CITY>
ORDER BY embedding <=> <QUERY_EMBEDDING>
LIMIT 10;

Possible policy:

if exact_hash_match:
    reject

if ngram_overlap > threshold:
    regenerate

if embedding_similarity > threshold and same_city_same_service:
    review_required

if operator_data_duplicate_score > threshold:
    block_or_manual_review

The thresholds should come from your own data.

Key idea:

Uniqueness should be a measured property, not a prompt instruction.

11. Make async generation reliable

Your workflow has this shape:

1. Save operator record in Postgres
2. Push generation job to queue

That creates a classic dual-write problem.

The DB write can succeed while queue publish fails. Or queue publish can happen twice. Or the worker can receive the same job more than once.

I would use the transactional outbox pattern:

  • AWS: Transactional outbox pattern

Flow:

Spring Boot transaction:
  - save operator record
  - insert generation_job
  - insert outbox_event

Outbox publisher:
  - reads unpublished outbox rows
  - sends message to SQS or worker queue
  - marks outbox row as published

Worker:
  - consumes job
  - checks idempotency key
  - builds fact pack
  - generates content
  - validates content
  - writes content version + validation report

If you use SQS Standard queues, design for at-least-once delivery. AWS documents that messages may be delivered more than once and consumers should be idempotent:

  • Amazon SQS at-least-once delivery
  • Amazon SQS queue types

Job payload:

{
  "job_id": "<JOB_ID>",
  "operator_id": "<OPERATOR_ID>",
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "prompt_version": "<PROMPT_VERSION>",
  "schema_version": "<SCHEMA_VERSION>",
  "attempt_number": 1,
  "idempotency_key": "<IDEMPOTENCY_KEY>"
}

12. Build private evals before choosing the model

Public leaderboards are useful for discovery, but they do not measure your exact task.

I would create an offline eval set:

100-300 real or representative operator records

Include difficult cases:

- rich operator data
- sparse operator data
- same city + same service
- missing rate
- missing experience
- missing insurance
- missing certifications
- no reviews
- ambiguous service area
- bot-like duplicate registrations

Evaluate models and prompts on:

schema_pass_rate
unsupported_claim_rate
forbidden_claim_rate
required_fact_inclusion_rate
duplicate_risk_rate
sparse_profile_inflation_rate
FAQ_grounding_rate
repair_attempts_per_accepted_output
human_acceptance_rate
latency
accepted_output_cost

References:

  • OpenAI: Working with evals
  • OpenAI: Evaluation best practices
  • OpenAI Evals GitHub
  • Promptfoo

Do not choose based on five nice-looking examples.

Choose based on accepted-output cost:

accepted_output_cost =
  first_generation_cost
  + repair_generation_cost
  + validation_cost
  + duplicate-regeneration cost
  + human-review cost, if triggered

A cheaper model may be more expensive in production if it causes more repairs and reviews.


13. Model shortlist I would test

I would still avoid choosing the model from public vibes alone.

But if I had to build an initial shortlist, I would test models that cover different tradeoffs:

Candidate Why test it
google/gemma-4-26B-A4B-it First practical candidate; strong size/performance profile
google/gemma-4-31B-it Gemma-family quality ceiling
Qwen/Qwen3.6-27B Dense 27B challenger
Qwen/Qwen3.6-35B-A3B Efficient MoE challenger
mistralai/Mistral-Small-4-119B-2603 Heavier quality comparison
CohereLabs/command-a-plus-05-2026-w4a4 Enterprise/business prose comparison
moonshotai/Kimi-K2-Instruct-0905 Upper-bound comparison
meta-llama/Llama-3.3-70B-Instruct Stable baseline

Why Gemma 4 should be included

I would definitely include the Gemma 4 family, especially:

google/gemma-4-26B-A4B-it
google/gemma-4-31B-it

google/gemma-4-26B-A4B-it is interesting because it is a Mixture-of-Experts model. OpenRouter describes it as 25.2B total parameters with only 3.8B active per token, 256K context, structured output support, function calling, reasoning mode, and Apache 2.0 licensing:

  • OpenRouter: Gemma 4 26B A4B
  • Hugging Face: google/gemma-4-26B-A4B-it
  • Google Gemma 4 model card

For this task, I would treat it as the first practical candidate.

I would use:

Gemma 4 26B A4B:
  first model to try
  strong size/performance candidate
  good API-evaluation candidate

Gemma 4 31B:
  quality ceiling inside Gemma 4
  useful to check whether A4B loses anything important

Why Qwen3.6 should be included

I would also test:

Qwen/Qwen3.6-27B
Qwen/Qwen3.6-35B-A3B

Qwen/Qwen3.6-27B is a strong dense comparison point. Its model card says the artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and similar runtimes:

  • Hugging Face: Qwen/Qwen3.6-27B
  • Qwen blog: Qwen3.6-27B

I would test it for:

- instruction following
- JSON/schema stability
- factual discipline
- natural business prose
- repair rate

Qwen/Qwen3.6-35B-A3B is also worth testing as an efficient MoE-style challenger:

  • Hugging Face: Qwen/Qwen3.6-35B-A3B

Why Mistral Small 4 and Command A+ might be useful

I would include Mistral Small 4 if budget and latency allow it:

  • Hugging Face: mistralai/Mistral-Small-4-119B-2603
  • OpenRouter: Mistral Small 4

I would not use it because every profile needs heavy reasoning. I would use it to see whether a stronger model reduces:

- unsupported claims
- repair attempts
- generic filler
- duplicate-like phrasing
- awkward FAQ output

I would also test Command A+ if business/enterprise prose is important:

  • Hugging Face: CohereLabs/command-a-plus-05-2026-w4a4
  • Cohere: Introducing Command A+

Command A+ is interesting as an enterprise/business prose comparison, not necessarily as the first production choice.

How I would choose

I would not choose based on first-generation prose quality alone.

For each model, I would measure:

Metric Meaning
schema_pass_rate Does it follow the JSON contract?
unsupported_claim_rate Does it invent facts?
forbidden_claim_rate Does it output banned claims?
duplicate_risk_rate Does it produce near-duplicate text?
sparse_profile_inflation_rate Does it inflate weak input?
repair_attempts_per_accepted_output How often does it need fixing?
human_acceptance_rate Do reviewers accept it?
accepted_output_cost True cost after validation/repair/review

My initial practical bet would be:

Start with:
  google/gemma-4-26B-A4B-it
  Qwen/Qwen3.6-27B
  mistralai/Mistral-Small-4-119B-2603

Then expand to:
  google/gemma-4-31B-it
  Qwen/Qwen3.6-35B-A3B
  CohereLabs/command-a-plus-05-2026-w4a4

If Gemma 4 26B A4B gives strong validation pass rates and low repair rates, I would favor it as the first production candidate because of its size/performance profile.

If Qwen3.6 follows constraints better, I would choose Qwen.

If Mistral Small 4 dramatically reduces unsupported claims and repair attempts, I would consider paying more for it.

The model decision should come after the pipeline exists, because the pipeline defines what “good” means.


14. Improve operator input UX

If the operator data is weak, the model has only two safe choices:

write short content
or ask for more data

The unsafe choice is:

inflate sparse data into a long profile

So I would improve the onboarding form.

Collect structured fields like:

- primary service
- secondary services
- city / service area
- years of experience
- license / certification
- insurance
- languages
- availability
- rate / price range
- specialties
- customer type
- examples of work
- short self-written note
- proof fields for verified claims

Then use a fact density score:

Fact density Content policy
high full profile, services, FAQ, SEO title/meta
medium shorter bio, limited FAQ
low short profile only, ask for more facts, maybe public-unverified or noindex

This may improve SEO quality more than changing the model.

The best way to make useful pages is to collect useful facts.


15. Use content versioning

Do not overwrite generated content in place.

Possible tables:

operator
operator_profile
operator_profile_version
generation_job
generation_outbox
generation_validation_report
profile_embedding
manual_review_task
operator_edit

Each generated version should store:

operator_id
profile_version_id
generated_json
published_json
validation_report
source_fact_hash
prompt_version
schema_version
provider
model
generation_params
created_at
published_at
verified_at

This matters because:

  • the operator may edit AI content
  • the platform may verify claims later
  • a new model may regenerate content
  • reviewers may approve or reject changes
  • you need rollback
  • edits become useful future eval/fine-tuning data

16. Do not start with fine-tuning

Fine-tuning can help later, but I would not start there.

First build:

- content schema
- fact pack
- validators
- duplicate checks
- private evals
- validation reports
- review states

Only after that would I consider fine-tuning.

Later, you can use:

operator facts
  + generated output
  + validation report
  + operator edits
  + reviewer decisions

to create:

SFT data:
  fact pack → good structured profile JSON

Preference data:
  chosen good output vs rejected bad output

Verifier data:
  fact pack + generated profile → validation report

If you fine-tune, I would start with LoRA/QLoRA rather than full fine-tuning:

  • Hugging Face PEFT LoRA docs
  • Hugging Face PEFT quantization / QLoRA guide
  • TRL SFTTrainer
  • TRL DPOTrainer

But that is a later phase.


Practical build order

Phase 1: Offline prototype

1. Collect 100-300 representative operator records
2. Define content schema
3. Define fact pack schema
4. Define forbidden claims
5. Generate outputs with 2-4 models
6. Validate schema
7. Validate factuality
8. Check duplicate risk
9. Human-review 30-50 outputs
10. Tune prompt/schema/validators

Phase 2: MVP generation pipeline

1. Add generation_job table
2. Add content version table
3. Add validation report table
4. Add outbox table
5. Add worker
6. Add OpenRouter adapter
7. Add structured output
8. Add schema validation
9. Add basic fact/forbidden-claim checks
10. Add repair loop

Phase 3: SEO and duplicate safety

1. Add fact density scoring
2. Add sparse profile policy
3. Add n-gram duplicate checks
4. Add embeddings
5. Add pgvector similarity search
6. Add same-city/service duplicate policy
7. Add noindex/review-required rules for weak pages

Phase 4: Review and verification

1. Add PUBLIC_UNVERIFIED state
2. Add REVIEW_REQUIRED state
3. Add VERIFIED state
4. Add reviewer UI
5. Add operator edit UI
6. Store edits and review decisions

Phase 5: Model and tuning improvements

1. Run private evals regularly
2. Compare models by accepted-output cost
3. Add best-of-N generation if needed
4. Build verifier/reward model if useful
5. Consider LoRA/QLoRA or DPO after enough data exists

What I would avoid

I would avoid:

input row → one prompt → final paragraph → publish

I would avoid treating READY as trusted.

I would avoid writing long content for sparse operators.

I would avoid asking the model to make pages unique without measuring duplication.

I would avoid making “SEO-friendly” the main instruction.

I would avoid fine-tuning before you have evals and validation data.

I would avoid coupling business logic directly to one LLM provider.


Final summary

If I were building this from scratch, I would build a system that controls whether generated content is:

allowed
grounded
distinct
useful
publishable
reviewable
verifiable
versioned

The LLM is only the prose-generation component.

My first priorities would be:

1. content contract
2. fact pack
3. structured output
4. validation report
5. duplicate scoring
6. SEO/content quality policy
7. public-unverified vs verified states
8. private evals
9. reliable async jobs
10. operator input improvement
11. model comparison by accepted-output cost

The central rule:

The model can write the words, but the application should own the truth, consistency, publishing policy, and quality gates.

Discussion in the ATmosphere

Loading comments...