{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidhc2o6pk2skrzroizxjj7mavijwfqvxopjg4ocslqapeoqtvtwmq",
"uri": "at://did:plc:i2ne3m5q6oq4jcnvn4k55skm/app.bsky.feed.post/3mo6ltobov6s2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreie2usbvokwkpwi6msnsvjiq5wallnrtb7cb6bpux4vwttq7bzi3wm"
},
"mimeType": "image/jpeg",
"size": 40462
},
"description": "Today it's LLMs. Yesterday it was CDNs. Yesteryears long gone, Big Iron lore.",
"path": "/performance-engineering-evaluative-metrics/",
"publishedAt": "2026-06-13T15:43:01.000Z",
"site": "https://prose.winterschon.com",
"textContent": "## LLM Inference\n\nSector | Common metrics | Less-known / specialist metrics\n---|---|---\nGP-GPU Service Infrastructure | TTFT, ITL/TPOT, E2E request latency, TPS, RPS/QPS, Queue-Fill, KV/C metrics | request_queue_time, request_prefill_time, request_decode_time, inflight time, SM occupancy, HBM/DRAM memory throughput\nAPI Service Infrastructure | p50/p95/p99 latency, streaming TTFB/TTFT, RPS/QPS, error rate, 429 rate, active downstream connections, backend latency | x-envoy-upstream-service-time, upstream_rq_time, TR/Tw/Tc/Tr/TA timer split in HAProxy, retry rate, queue-slot saturation, circuit-breaker usage\nNetwork Hardware + Protocol Infra | RTT (round-trip time), one-way latency, jitter/PDV (packet delay variation), PPS/BPS, packet loss, retransmits | ECN mark rate, PFC (priority flow control) pause events, microburst depth, CQE (completion queue entry) compression state, SymbolErrors, LinkRecoveries, PTP offset/path delay\nPrompt Caching, Compute + Re-Compute | cache hit rate, cached_tokens, prompt_tokens, generation_tokens, TTFT reduction, cache read/write counts | partial-hit ratio, cache-miss ratio, prefix-cache block hit rate, predicted KV hit rate, per-request routing overhead\nPrompt Caching, Storage Infra | cache hit ratio, eviction rate, cache latency, TTL, GPU/CPU cache usage | CPU-vs-GPU cache-hit split, priority-based eviction effect, TTL refresh behavior, KV-block fragmentation / block reuse\nPrompt Caching, API + Load-Balancers | per-route/model hit rate, active and pending requests, backend latency, request rate, retries, error rate | cache-aware routing predicted KV hit rate, routing overhead, queue time before backend selection, circuit-breaker saturation, connection-slot pressure\n\n### Acronyms\n\n * TTFT == Time to First Token\n * ITL == Inter-Token Latency\n * TPOT == Time per Output Token\n * E2E == end-to-end\n * TKS == tokens/sec\n * RPS == requests/sec\n * QPS == queries/sec\n * KV/C == Key-Value Cache\n\n\n\n## LLM Dataset Training\n\nOperations | Common metrics | Less-known / specialist metrics\n---|---|---\nPre-loading datasets | ingest throughput (docs/s, rows/s, bytes/s, tokens/s), tokenization throughput, DataLoader throughput, dedup ratio | fuzzy-duplicate rate, semantic-duplicate rate, contamination rate/AUC, consumer lag, shuffle spill bytes\nPre-training + MoE | tokens/s, step time, MFU (model FLOPs utilization), training loss/perplexity, all-to-all/dispatch time, GPU memory | auxiliary load-balancing loss, router z-loss, expert capacity factor, token-drop rate, per-layer expert imbalance\nPost-training + MoE | step time, samples/s or tokens/s, train/val loss, reward/preference accuracy, KL divergence, MFU | chosen vs rejected reward margin, chosen/rejected rewards, router aux/z loss during alignment, expert saturation under small-batch tuning\n\n## HPC + HFT Analytics Infrastructure\n\nSector | Common metrics | Less-known / specialist metrics\n---|---|---\nVirtual Machine Clusters | p50/p99 service latency, jitter, CPU ready time, throughput, packet loss | CPU co-stop, NUMA locality (numa_hit/numa_miss/local_node/other_node), vCPU scheduling contention, latency-sensitivity effectiveness\nBaremetal Systems | wire-to-wire latency, p99/p999 jitter, PPS/Mpps, cycles, instructions, LLC-load-misses, branch-misses, RX/TX drops | NUMA locality, interrupt coalescence, CQE compression state, packet pacing, driver extended stats / ring stress\nContent Delivery Networking | cache hit ratio, TTFB, origin offload, request rate, error rate, egress bandwidth | shield-layer hit ratio, origin_ttfb, child/parent cache status, regional cache-performance variance\nLow-Latency Trade-Execution Networks | one-way latency, RTT, jitter, packet loss, order latency, feed latency | microburst depth, queue/buffer pressure, ECN mark rate, PFC pause events, path asymmetry, PTP offset\nDark Fiber Regional Network Infra | one-way latency, RTT, availability, BER, pre-FEC BER, post-FEC BER | OSNR/ESNR, Q-factor, CD (chromatic dispersion), PMD (polarization mode dispersion), FEC degrade indicators\nQuantitative Research + Machine Learning | backtest wall-clock runtime, feed latency, order latency, feature-serving latency, feature-ingestion throughput | training-serving skew, feature inflight vs write-to-store success metrics, feature health/correctness monitoring\nData Analytics + Multivariate Analysis | job duration, stage/task duration, throughput, end-to-end delay, records-consumed-rate, consumer lag, shuffle read/write | spill bytes, skew via task-duration discrepancy or shuffle-read imbalance, straggler share, input-pipeline prefetch effectiveness\n\n## SLA + SLO Reliability, Telemetry, Alerting\n\n * Pre-Defined latency/error/throughput SLIs and error budgets require burn-rate alerting\n * Prometheus and Alertmanager define scrape and notification timing controls\n * OpenTelemetry defines histograms and exemplars\n * Apdex is a standard user-satisfaction score\n * Elastic APM measures application performance traces\n\nSector | Common metrics | Less-known / specialist metrics\n---|---|---\nSLA + SLO Monitoring, Telemetry, Alerting Infra | availability SLI, latency SLI, error rate, throughput, Apdex, error-budget burn rate | multi-window multi-burn-rate alerts, scrape_duration / scrape_timeout ratio, group_wait / group_interval / repeat_interval, histogram bucket design, exemplars\n\n## Shared Analytics of Interest\n\nMetric | Signal | Typical domains\n---|---|---\nQueue time | Separates saturation from raw compute/network slowness | LLM serving, API gateways, load balancers, HFT\nPrefill vs decode split | Distinguishes prompt-processing bottlenecks from token-generation bottlenecks | LLM GPU serving\nPrefix/KV cache hit rate | Direct proxy for avoidable recompute and TTFT improvement | LLM serving, agentic systems\nAuxiliary MoE loss and router z-loss | Early warning for expert imbalance and routing instability | MoE training\nCPU ready / co-stop / NUMA miss | Often the real cause of inconsistent latency in virtualized clusters | VM-based HFT / HPC\nMicroburst depth and ECN/PFC behavior | Reveals congestion that average bandwidth hides | Low-latency Ethernet / RoCE fabrics\nOSNR / CD / PMD / pre-FEC BER | Core optical-health indicators long before full link failure | Dark fiber / coherent optics\nSpill bytes and consumer lag | Early warning for data-path backpressure and skew | Big-data pipelines\nBurn rate and exemplars | Better operational signal than raw alert count or average latency | SLO / observability stacks",
"title": "Performance Engineering - Evaluative Metrics",
"updatedAt": "2026-06-13T15:43:01.855Z"
}