Assimetric parallel inference using consumer RTX PC
Hmm… at first I thought this might be feasible on many systems, but after digging into it, I’m less sure: in quite a few environments, the CPU-vs-iGPU speed gap may not be that large.
I think the core idea is interesting. I would just separate a few layers that are mixed together in the proposal.
The part that sounds strongest to me is not “parallel inference” in the usual model-parallel sense. The main model is not being split across multiple GPUs. I would describe this more as a sidecar guardrail / out-of-band validator :
RTX / main GPU:
main LLM generation
CPU:
deterministic validation
orchestration
parsing
watchdogs
iGPU or CPU sidecar:
optional small guard model
semantic checks
short pass/fail verdicts
That framing makes the idea much easier to reason about.
1. The promising part
The promising part is:
use the main RTX GPU for the expensive model, and use some other local compute path to cheaply monitor the output.
That can make sense, especially if the Sentinel is not trying to be another full assistant.
A good Sentinel should probably be narrow and boring:
{
"verdict": "pass",
"risk": "low",
"reason": null
}
or:
{
"verdict": "fail",
"risk": "medium",
"reason": "instruction_drift",
"action": "retry_with_stricter_format"
}
That is a much easier task than running a second general-purpose chatbot.
2. The hardware question is not just “does it have an iGPU?”
For LLM work, I would use local-LLM backend support as the practical boundary.
Useful references:
- Ollama hardware support
- llama.cpp SYCL backend
Ollama’s practical GPU paths are roughly:
| Path | Meaning |
|---|---|
| NVIDIA CUDA | normal consumer dGPU path |
| AMD ROCm | good when the exact AMD GPU/APU is supported |
| Apple Metal | Apple Silicon path |
| Vulkan | broader Windows/Linux path, but benchmark carefully |
For Intel iGPU, I would not think of old Intel HD/UHD graphics as the target. I would start around Iris Xe / 11th Gen Core , with Core Ultra built-in Arc or Lunar Lake Arc being much more plausible.
For AMD, I would not count the tiny 2-CU display iGPU in normal AM5 Ryzen chips as the interesting target. I would start the serious discussion around Radeon 780M / Radeon 890M / Ryzen AI / Ryzen AI Max class hardware.
Roughly:
| Hardware class | My expectation for this idea |
|---|---|
| old Intel HD/UHD graphics | probably not worth targeting |
| Intel UHD 730/770-class desktop iGPU | maybe visible, likely weak |
| Intel Iris Xe 80/96EU | lower bound worth testing |
| Intel Core Ultra built-in Arc | plausible |
| Lunar Lake Arc 130V/140V | promising, but backend-sensitive |
| normal AM5 Ryzen 2-CU iGPU | probably not the target |
| AMD Radeon 760M | maybe |
| AMD Radeon 780M | plausible lower bound |
| AMD Radeon 890M / Ryzen AI 300 | stronger candidate |
| Ryzen AI Max / Radeon 8050S/8060S | serious UMA local-LLM class |
| RTX 3090 / 4090 / etc. | main inference engine, not the sidecar |
3. CPU baseline is the key test
This is the most important practical point.
For this specific Sentinel idea, I would not assume:
GPU = faster than CPU
The Sentinel workload is likely to be:
- batch size 1
- small model, maybe 1B–3B
- short output
- low latency
- no large serving batch
- mostly short verdicts
That is not necessarily where a weak GPU shines.
GPU speed in LLM inference mostly comes from:
| Source of speed | Why it helps |
|---|---|
| high memory bandwidth | important for token generation / decode |
| optimized matrix kernels | important for matmul-heavy work |
| enough parallel work | important for keeping GPU compute busy |
| batching | improves throughput and weight reuse |
| mature backend | CUDA/ROCm/SYCL/Vulkan quality matters a lot |
A useful split is prefill vs decode :
| Phase | What happens | Performance character |
|---|---|---|
| prefill | read/process the prompt | more GPU-friendly, more compute-heavy |
| decode | generate one token at a time | often memory-bound, more sequential |
References:
- NVIDIA: Mastering LLM Techniques: Inference Optimization
- HF/TNG: Prefill and Decode for Concurrent Requests
- Redis: Prefill vs Decode
- SARATHI paper
This matters because an iGPU is not a small dGPU. A dGPU often wins through dedicated high-bandwidth VRAM. An iGPU usually shares system RAM with the CPU, so it does not fully inherit the usual dGPU advantage.
Also, CPU-only inference is not a weak baseline. llama.cpp/Ollama-style CPU inference uses quantized weights and SIMD-heavy kernels. For small guard models, CPU can be surprisingly competitive.
So I would benchmark CPU-only first.
4. Public iGPU evidence looks mixed
The public numbers I found are uneven, which is probably the right lesson.
Intel Core Ultra 125H
One public Core Ultra 5 125H benchmark found that the iGPU often beat CPU, but not by a huge margin. Some very small models were faster on CPU.
Source: Benchmarking AI models on a Core Ultra 5 125H iGPU
| Model | CPU tok/s | iGPU tok/s | iGPU/CPU |
|---|---|---|---|
| llama3.1:8b | 9.76 | 12.69 | 1.30x |
| qwen2.5:7b | 10.26 | 13.06 | 1.27x |
| phi4:14b | 5.27 | 7.11 | 1.35x |
| llama3.2:3b | 20.63 | 23.20 | 1.12x |
| smollm2:1.7b | 27.41 | 27.84 | 1.02x |
| smollm2:360m | 57.56 | 35.13 | 0.61x |
| opencoder:1.5b | 32.88 | 17.67 | 0.54x |
So for Intel Core Ultra-class iGPU, I would not assume a dramatic win over CPU. It may still be useful, especially for freeing CPU or reducing power, but the CPU baseline is very real.
AMD Radeon 780M
There is at least one Ollama issue where Radeon 780M clearly beats CPU on the same machine.
Source: Ollama issue: Support for Radeon 780M
| Hardware | CPU | Radeon 780M path | Ratio |
|---|---|---|---|
| Ryzen 7 PRO 7840U + Radeon 780M | 6.23 tok/s | 18.66 tok/s | about 3.0x |
That is promising, but the issue is also backend/workaround related. I would read it as:
Radeon 780M-class hardware can cross the CPU baseline under the right backend/driver/memory conditions.
Not:
Every 780M setup will automatically do this.
Radeon 780M and 890M standalone numbers
LocalScore has standalone iGPU results that are useful as rough hints, though not always CPU comparisons.
Radeon 780M:
Source: LocalScore: Radeon 780M
| Model | Prompt speed | Generation speed |
|---|---|---|
| Llama 3.2 1B Q4_K | 690 tok/s | 11.7 tok/s |
| Llama 3.2 3B Q4_K | 288 tok/s | 6.4 tok/s |
Radeon 890M:
Source: LocalScore: Radeon 890M
| Model | Prompt speed | Generation speed |
|---|---|---|
| Llama 3.2 1B Q4_K | 551 tok/s | 67.0 tok/s |
| Llama 3.1 8B Q4_K | 99 tok/s | 12.9 tok/s |
| Qwen2.5 14B Q4_K | 51 tok/s | 7.1 tok/s |
These do not settle CPU-vs-iGPU by themselves, but they support the general pattern: prompt processing can look very fast, while generation speed is the number to check for short interactive validation.
Arc 140V / Lunar Lake
I found Arc 140V promising, but harder to summarize cleanly. Public results look more backend-sensitive. Some community results suggest backend choice can flip the result from CPU-losing to clearly CPU-winning.
So I would avoid overclaiming here unless the exact runtime/backend is specified.
5. A practical interpretation table
For this sidecar design, I would interpret results like this:
| iGPU result vs CPU-only | Interpretation |
|---|---|
| slower than CPU | probably not worth it except as an experiment |
| about equal | maybe useful if it frees CPU for orchestration/RAG/validation |
| 1.2–1.5x faster | plausible for a sidecar Sentinel |
| 2x+ faster | clearly useful |
| 3x+ faster | strong result |
| unstable/backend-specific | experimental until reproducible |
So I would not describe the design as “the iGPU makes it faster” until the CPU-only baseline is measured.
A more careful claim would be:
The iGPU is useful if it either beats CPU-only on the actual Sentinel workload, or gives similar latency while freeing CPU resources for the rest of the application.
6. What should be deterministic code vs small model
I like the Sentinel idea more if it is layered.
I would not use a small LLM for everything.
| Check | Best layer |
|---|---|
| valid JSON syntax | JSON parser |
| required fields / enum / type | JSON Schema / Pydantic / Zod |
| regex match | regex |
| repetition loop | token/string heuristic |
| timeout/stall | watchdog / heartbeat |
| malformed tool call | deterministic parser + schema |
| instruction drift | small judge/classifier |
| safety classification | guard model |
| prompt injection / jailbreak risk | small classifier / guard model |
| tool-call plausibility | rules + small semantic model |
| repair hint | optional small model or main model retry |
For structured output, local runtimes already provide useful tools:
- Ollama structured outputs
- llama.cpp grammars / GBNF
Those can reduce malformed output, but I would still validate downstream. Grammar-constrained decoding is not a replacement for application-level validation.
7. The guardrail part has existing parallels
The Sentinel model idea is not strange. It resembles existing guardrail/classifier designs.
Useful references:
- Llama Guard paper
- Llama Guard 3 1B model card
- Llama Prompt Guard 2 86M
- NeMo Guardrails paper
- NeMo Guardrails GitHub
So the stronger framing is not:
A tiny LLM checks everything.
It is:
Deterministic validators do deterministic work, and a small semantic guard model handles fuzzy cases.
8. What would make the proposal convincing
I think this benchmark would make the idea much more concrete:
| Benchmark | Why it matters |
|---|---|
| CPU-only Sentinel | required baseline |
| iGPU Sentinel | real sidecar speed |
| same model, same quant, same prompt | avoids fake comparisons |
| pp512 / pp1024 | can it scan output buffers quickly? |
| tg16 / tg32 / tg64 | can it emit short verdicts fast? |
| end-to-end verdict latency | the real user-facing number |
| concurrent RTX generation + Sentinel | checks interference |
| CPU load while iGPU runs | measures “freeing CPU” benefit |
| failure path latency | retry/repair may dominate |
| false positives / false negatives | guardrail quality matters |
| device selection stability | important in RTX + iGPU systems |
| fallback behavior | what happens if the GPU backend fails? |
For this design, the most important number is not just tok/s. It is:
How long until the system can safely release, block, or repair the main model’s response?
9. A concrete version of the architecture
I would build the practical version like this:
main LLM on RTX GPU
|
v
streamed output buffer
|
+--> deterministic validators on CPU
| - JSON parser
| - schema validation
| - regex
| - repetition detection
| - timeout/watchdog
|
+--> optional small guard model on iGPU or CPU
- instruction drift
- safety category
- prompt/tool plausibility
- short verdict only
Possible verdict format:
{
"verdict": "pass",
"risk": "low",
"reason": null
}
or:
{
"verdict": "fail",
"risk": "medium",
"reason": "instruction_drift",
"action": "retry_with_stricter_format"
}
The small model should not produce long explanations. It should produce short, boring, machine-readable verdicts.
10. My practical take
I think the proposal is directionally interesting.
I would just make the claim narrower:
| Question | My answer |
|---|---|
| Is this normal parallel inference? | Not really |
| Is it a plausible sidecar validator? | Yes |
| Should arbitrary iGPUs be included? | No |
| Is CPU baseline important? | Very important |
| Is AMD 780M/890M class interesting? | Yes |
| Is Intel Iris Xe/Core Ultra Arc interesting? | Yes, but backend-sensitive |
| Is normal AM5 Ryzen 2-CU iGPU interesting? | Probably not |
| Should JSON/schema/regex be done by the small LLM? | No |
| Should fuzzy semantic checks use a small model? | Yes, possibly |
| Should this be trusted without benchmark data? | Not yet |
My summary:
This seems plausible as a sidecar guardrail pipeline, not as general parallel inference. The hardware boundary should be “modern local-LLM backend support plus a CPU-baseline win,” not just “has an iGPU.” Use deterministic code for deterministic validation, and reserve the small model for fuzzy semantic checks. If the intended 1B–3B Sentinel model does not beat CPU-only, the iGPU may still be useful for CPU isolation, but it should not be described as a speed win.
Discussion in the ATmosphere