External Publication

Assimetric parallel inference using consumer RTX PC

Hugging Face Forums [Unofficial] June 15, 2026

Hmm… at first I thought this might be feasible on many systems, but after digging into it, I’m less sure: in quite a few environments, the CPU-vs-iGPU speed gap may not be that large.

I think the core idea is interesting. I would just separate a few layers that are mixed together in the proposal.

The part that sounds strongest to me is not “parallel inference” in the usual model-parallel sense. The main model is not being split across multiple GPUs. I would describe this more as a sidecar guardrail / out-of-band validator :

RTX / main GPU:
  main LLM generation

CPU:
  deterministic validation
  orchestration
  parsing
  watchdogs

iGPU or CPU sidecar:
  optional small guard model
  semantic checks
  short pass/fail verdicts

That framing makes the idea much easier to reason about.

1. The promising part

The promising part is:

use the main RTX GPU for the expensive model, and use some other local compute path to cheaply monitor the output.

That can make sense, especially if the Sentinel is not trying to be another full assistant.

A good Sentinel should probably be narrow and boring:

{
  "verdict": "pass",
  "risk": "low",
  "reason": null
}

or:

{
  "verdict": "fail",
  "risk": "medium",
  "reason": "instruction_drift",
  "action": "retry_with_stricter_format"
}

That is a much easier task than running a second general-purpose chatbot.

2. The hardware question is not just “does it have an iGPU?”

For LLM work, I would use local-LLM backend support as the practical boundary.

Useful references:

Ollama hardware support
llama.cpp SYCL backend

Ollama’s practical GPU paths are roughly:

Path	Meaning
NVIDIA CUDA	normal consumer dGPU path
AMD ROCm	good when the exact AMD GPU/APU is supported
Apple Metal	Apple Silicon path
Vulkan	broader Windows/Linux path, but benchmark carefully

For Intel iGPU, I would not think of old Intel HD/UHD graphics as the target. I would start around Iris Xe / 11th Gen Core , with Core Ultra built-in Arc or Lunar Lake Arc being much more plausible.

For AMD, I would not count the tiny 2-CU display iGPU in normal AM5 Ryzen chips as the interesting target. I would start the serious discussion around Radeon 780M / Radeon 890M / Ryzen AI / Ryzen AI Max class hardware.

Roughly:

Hardware class	My expectation for this idea
old Intel HD/UHD graphics	probably not worth targeting
Intel UHD 730/770-class desktop iGPU	maybe visible, likely weak
Intel Iris Xe 80/96EU	lower bound worth testing
Intel Core Ultra built-in Arc	plausible
Lunar Lake Arc 130V/140V	promising, but backend-sensitive
normal AM5 Ryzen 2-CU iGPU	probably not the target
AMD Radeon 760M	maybe
AMD Radeon 780M	plausible lower bound
AMD Radeon 890M / Ryzen AI 300	stronger candidate
Ryzen AI Max / Radeon 8050S/8060S	serious UMA local-LLM class
RTX 3090 / 4090 / etc.	main inference engine, not the sidecar

3. CPU baseline is the key test

This is the most important practical point.

For this specific Sentinel idea, I would not assume:

GPU = faster than CPU

The Sentinel workload is likely to be:

batch size 1
small model, maybe 1B–3B
short output
low latency
no large serving batch
mostly short verdicts

That is not necessarily where a weak GPU shines.

GPU speed in LLM inference mostly comes from:

Source of speed	Why it helps
high memory bandwidth	important for token generation / decode
optimized matrix kernels	important for matmul-heavy work
enough parallel work	important for keeping GPU compute busy
batching	improves throughput and weight reuse
mature backend	CUDA/ROCm/SYCL/Vulkan quality matters a lot

A useful split is prefill vs decode :

Phase	What happens	Performance character
prefill	read/process the prompt	more GPU-friendly, more compute-heavy
decode	generate one token at a time	often memory-bound, more sequential

References:

NVIDIA: Mastering LLM Techniques: Inference Optimization
HF/TNG: Prefill and Decode for Concurrent Requests
Redis: Prefill vs Decode
SARATHI paper

This matters because an iGPU is not a small dGPU. A dGPU often wins through dedicated high-bandwidth VRAM. An iGPU usually shares system RAM with the CPU, so it does not fully inherit the usual dGPU advantage.

Also, CPU-only inference is not a weak baseline. llama.cpp/Ollama-style CPU inference uses quantized weights and SIMD-heavy kernels. For small guard models, CPU can be surprisingly competitive.

So I would benchmark CPU-only first.

4. Public iGPU evidence looks mixed

The public numbers I found are uneven, which is probably the right lesson.

Intel Core Ultra 125H

One public Core Ultra 5 125H benchmark found that the iGPU often beat CPU, but not by a huge margin. Some very small models were faster on CPU.

Source: Benchmarking AI models on a Core Ultra 5 125H iGPU

Model	CPU tok/s	iGPU tok/s	iGPU/CPU
llama3.1:8b	9.76	12.69	1.30x
qwen2.5:7b	10.26	13.06	1.27x
phi4:14b	5.27	7.11	1.35x
llama3.2:3b	20.63	23.20	1.12x
smollm2:1.7b	27.41	27.84	1.02x
smollm2:360m	57.56	35.13	0.61x
opencoder:1.5b	32.88	17.67	0.54x

So for Intel Core Ultra-class iGPU, I would not assume a dramatic win over CPU. It may still be useful, especially for freeing CPU or reducing power, but the CPU baseline is very real.

AMD Radeon 780M

There is at least one Ollama issue where Radeon 780M clearly beats CPU on the same machine.

Source: Ollama issue: Support for Radeon 780M

Hardware	CPU	Radeon 780M path	Ratio
Ryzen 7 PRO 7840U + Radeon 780M	6.23 tok/s	18.66 tok/s	about 3.0x

That is promising, but the issue is also backend/workaround related. I would read it as:

Radeon 780M-class hardware can cross the CPU baseline under the right backend/driver/memory conditions.

Not:

Every 780M setup will automatically do this.

Radeon 780M and 890M standalone numbers

LocalScore has standalone iGPU results that are useful as rough hints, though not always CPU comparisons.

Radeon 780M:

Source: LocalScore: Radeon 780M

Model	Prompt speed	Generation speed
Llama 3.2 1B Q4_K	690 tok/s	11.7 tok/s
Llama 3.2 3B Q4_K	288 tok/s	6.4 tok/s

Radeon 890M:

Source: LocalScore: Radeon 890M

Model	Prompt speed	Generation speed
Llama 3.2 1B Q4_K	551 tok/s	67.0 tok/s
Llama 3.1 8B Q4_K	99 tok/s	12.9 tok/s
Qwen2.5 14B Q4_K	51 tok/s	7.1 tok/s

These do not settle CPU-vs-iGPU by themselves, but they support the general pattern: prompt processing can look very fast, while generation speed is the number to check for short interactive validation.

Arc 140V / Lunar Lake

I found Arc 140V promising, but harder to summarize cleanly. Public results look more backend-sensitive. Some community results suggest backend choice can flip the result from CPU-losing to clearly CPU-winning.

So I would avoid overclaiming here unless the exact runtime/backend is specified.

5. A practical interpretation table

For this sidecar design, I would interpret results like this:

iGPU result vs CPU-only	Interpretation
slower than CPU	probably not worth it except as an experiment
about equal	maybe useful if it frees CPU for orchestration/RAG/validation
1.2–1.5x faster	plausible for a sidecar Sentinel
2x+ faster	clearly useful
3x+ faster	strong result
unstable/backend-specific	experimental until reproducible

So I would not describe the design as “the iGPU makes it faster” until the CPU-only baseline is measured.

A more careful claim would be:

The iGPU is useful if it either beats CPU-only on the actual Sentinel workload, or gives similar latency while freeing CPU resources for the rest of the application.

6. What should be deterministic code vs small model

I like the Sentinel idea more if it is layered.

I would not use a small LLM for everything.

Check	Best layer
valid JSON syntax	JSON parser
required fields / enum / type	JSON Schema / Pydantic / Zod
regex match	regex
repetition loop	token/string heuristic
timeout/stall	watchdog / heartbeat
malformed tool call	deterministic parser + schema
instruction drift	small judge/classifier
safety classification	guard model
prompt injection / jailbreak risk	small classifier / guard model
tool-call plausibility	rules + small semantic model
repair hint	optional small model or main model retry

For structured output, local runtimes already provide useful tools:

Ollama structured outputs
llama.cpp grammars / GBNF

Those can reduce malformed output, but I would still validate downstream. Grammar-constrained decoding is not a replacement for application-level validation.

7. The guardrail part has existing parallels

The Sentinel model idea is not strange. It resembles existing guardrail/classifier designs.

Useful references:

Llama Guard paper
Llama Guard 3 1B model card
Llama Prompt Guard 2 86M
NeMo Guardrails paper
NeMo Guardrails GitHub

So the stronger framing is not:

A tiny LLM checks everything.

It is:

Deterministic validators do deterministic work, and a small semantic guard model handles fuzzy cases.

8. What would make the proposal convincing

I think this benchmark would make the idea much more concrete:

Benchmark	Why it matters
CPU-only Sentinel	required baseline
iGPU Sentinel	real sidecar speed
same model, same quant, same prompt	avoids fake comparisons
pp512 / pp1024	can it scan output buffers quickly?
tg16 / tg32 / tg64	can it emit short verdicts fast?
end-to-end verdict latency	the real user-facing number
concurrent RTX generation + Sentinel	checks interference
CPU load while iGPU runs	measures “freeing CPU” benefit
failure path latency	retry/repair may dominate
false positives / false negatives	guardrail quality matters
device selection stability	important in RTX + iGPU systems
fallback behavior	what happens if the GPU backend fails?

For this design, the most important number is not just tok/s. It is:

How long until the system can safely release, block, or repair the main model’s response?

9. A concrete version of the architecture

I would build the practical version like this:

main LLM on RTX GPU
  |
  v
streamed output buffer
  |
  +--> deterministic validators on CPU
  |      - JSON parser
  |      - schema validation
  |      - regex
  |      - repetition detection
  |      - timeout/watchdog
  |
  +--> optional small guard model on iGPU or CPU
         - instruction drift
         - safety category
         - prompt/tool plausibility
         - short verdict only

Possible verdict format:

{
  "verdict": "pass",
  "risk": "low",
  "reason": null
}

or:

{
  "verdict": "fail",
  "risk": "medium",
  "reason": "instruction_drift",
  "action": "retry_with_stricter_format"
}

The small model should not produce long explanations. It should produce short, boring, machine-readable verdicts.

10. My practical take

I think the proposal is directionally interesting.

I would just make the claim narrower:

Question	My answer
Is this normal parallel inference?	Not really
Is it a plausible sidecar validator?	Yes
Should arbitrary iGPUs be included?	No
Is CPU baseline important?	Very important
Is AMD 780M/890M class interesting?	Yes
Is Intel Iris Xe/Core Ultra Arc interesting?	Yes, but backend-sensitive
Is normal AM5 Ryzen 2-CU iGPU interesting?	Probably not
Should JSON/schema/regex be done by the small LLM?	No
Should fuzzy semantic checks use a small model?	Yes, possibly
Should this be trusted without benchmark data?	Not yet

My summary:

This seems plausible as a sidecar guardrail pipeline, not as general parallel inference. The hardware boundary should be “modern local-LLM backend support plus a CPU-baseline win,” not just “has an iGPU.” Use deterministic code for deterministic validation, and reserve the small model for fuzzy semantic checks. If the intended 1B–3B Sentinel model does not beat CPU-only, the iGPU may still be useful for CPU isolation, but it should not be described as a speed win.