External Publication

Show & Tell: TAF Agent v0.7 — 14 browser-only diagnostics for transformer LLMs (anti-bullshit pack)

Hugging Face Forums [Unofficial] May 7, 2026

# Show & Tell: TAF Agent v0.7 — 14 browser-only diagnostics for transformer LLMs (anti-bullshit pack)

TL;DR — A free, no-signup, browser-only tool that calls bullshit on common LLM-eval lies: misleading max_position_embeddings, silent chat-template halving in lm-eval-harness, hidden Chatbot Arena CIs, MMLU contamination priors, model-specific quant cliffs, and the NIAH-vs-reasoning gap.

Live : TAF Agent - a Hugging Face Space by karlexmarin

Source : GitHub - karlesmarin/tafagent: Transformer LLM diagnostic in your browser. Free, unlimited, auditable. · GitHub

Paper : [Marin 2026 — Predicting How Transformers Attend]( Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools )

-–

## What it is

I built a single static HTML+JS page that ships 14 diagnostic modes for transformer LLMs. The premise is simple: a lot of the things the community routinely complains about — leaderboard contamination, model-card lies, framework drift, quantization cliffs — are diagnosable from metadata alone (config.json, tokenizer_config.json, published vote counts), without spinning up a GPU or running inference.

Everything runs in your browser. Your inputs never leave the tab. There is no server, no signup, no telemetry. The Python tools that some modes use run via Pyodide ; the math is deterministic.

It’s available in EN / ES / FR / ZH (685 i18n keys, parity-checked).

-–

## What’s new in v0.7 — the anti-bullshit pack

After surveying public HF Forum threads, GitHub issues, arxiv papers, and Reddit posts, I picked 10 community pain points and shipped browser-only solutions for 8 of them. (The remaining 2 — VRAM-formal-bound and pre-fine-tune forgetting forecast — are the v0.8 roadmap.)

### Unmask — does max_position_embeddings lie?

Paste an HF model id. The tool reads config.json and tells you whether the declared context is honest, inflated (SWA window restricts effective range), severely inflated (Mistral-7B-v0.1 declares 32k but attends ~4-8k), or YaRN-extended (factor + original-pe).

Pre-flight verdicts on real public models:

mistralai/Mistral-7B-Instruct-v0.3 → HONEST 32k (v0.3 dropped SWA; v0.1 was the SWA-confused release)
microsoft/Phi-3-mini-4k-instruct → INFLATED (sliding-window=2047, hidden in config)
deepseek-ai/DeepSeek-V2.5 → YARN-EXTENDED (factor=40×, 4k → 163k)

### Chat-template Sniffer

Detects which template family (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / DeepSeek / Alpaca / custom / none) by reading tokenizer_config.json and gives you the exact CLI flag for lm-evaluation-harness, vLLM serve, and transformers. This solves [lm-eval-harness issue #1841]( Inconsistent evaluation results with Chat Template · Issue #1841 · EleutherAI/lm-evaluation-harness · GitHub ) — the one where forgetting –apply_chat_template silently halves multi-turn accuracy.

To my knowledge, no other public tool diffs the apply path and gives per-framework commands.

### Arena-Elo CI Reconstructor

[The Leaderboard Illusion]( [2504.20879] The Leaderboard Illusion ) (Apr 2025) diagnosed Chatbot Arena gaming and pointed out that public CIs are stripped. Paste a CSV of pairwise votes (model_a, model_b, winner) and the tool runs Bradley-Terry MLE + 200-iteration bootstrap and tells you which model pairs are statistically tied (CIs overlap). Has a “Load sample” button with synthetic 6-model data so you can see it work without hunting raw battle logs.

### Contamination Prior

Built-in DB of 20 popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, MMLU-Pro, GPQA, AIME 2024, BBH, MUSR, RULER, etc.). Enter your model’s training cutoff date — get a Bayesian prior of contamination per benchmark based on time gap, corpus inclusion, and known leak history. Llama-3.1 (cutoff 2023-12) on MMLU returns ~97% prior. Same model on AIME 2024 returns ~5%.

This complements GPU-bound detectors like CoDeC and Min-K% — it’s the pre-flight risk score , not a post-hoc detection.

### Quant-regime Classifier

Predicts γ-shift and ΔPPL for 10 quantization schemes (FP8, int8, GGUF Q8_0/Q5_K_M/Q4_K_M/Q3_K_M/Q2_K, AWQ, GPTQ, NF4) on a per-model basis. Architecture-aware: small d_head + aggressive GQA increases sensitivity; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4).

Pre-flight on mistralai/Mistral-7B-Instruct-v0.3:

AWQ → mild (γ-shift +0.023, ΔPPL ~0.01)
NF4 → CLIFF (γ-shift +0.081, ΔPPL ~0.06)

Recommends a switch when it detects a cliff.

### Cross-framework Drift Bound

Same model, different scores on different setups. Paste both with (framework, dtype, batch, chat-template applied?). Tool predicts the maximum drift admissible from numerical noise (additive: dtype-pair penalty + framework kernel diff + batch-ratio + 0.3-pt non-determinism floor). If observed gap exceeds it → real bug, usually chat-template mismatch (most common) or KV-cache layout. References: [arxiv 2506.09501]( [2506.09501] Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference ) on FP32-on-the-fly reproducibility.

### NIAH → Reasoning Gap Predictor

[RULER paper]( [2404.06654] RULER: What's the Real Context Size of Your Long-Context Language Models? ) showed long-context models often pass needle-retrieval but fail multi-hop reasoning at the same context. The HELMET work confirmed synthetic NIAH doesn’t predict downstream. This mode predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure), reports the gap, and finds your model’s “safe reasoning context” where reasoning stays ≥ 65%.

Pre-flight on meta-llama/Llama-3.1-8B-Instruct (claimed 128k, RoPE θ=500k, GQA 8/32):

@ 8k: NIAH 100% · Reason 94% → ROBUST
@ 64k: NIAH 100% · Reason 94% → ROBUST
@ 128k: NIAH 98% · Reason 92% → ROBUST
Curve drops past T_train (4× extrapolation = ~30% NIAH penalty)

-–

## Formal verification

There’s a companion repo at GitHub - karlesmarin/lean-taf: Lean 4 + Mathlib formalization of TAF algebraic identities + Cv Hagedorn erratum (Marin 2026) · GitHub with 37 theorems machine-proven in Lean 4 + Mathlib4 (1973 build jobs). Identities like β·χ = −1 (Anti-Ising closure), D-SAGE-1 quadratic, Padé z-substitution. Each badge in the TAF Card links to the source line. Includes one substantive finding — a factor-2 inconsistency in the paper’s own V/β formula tables (formally proved in V_derivative_ne_RG_beta).

Anyone can clone + lake build to re-verify in ~5 seconds after Mathlib cache fetch.

-–

## Honest limits

It predicts; it doesn’t measure. Verdicts are heuristics calibrated against published RULER / Grootendorst / arxiv data. For ground truth you still need a GPU.
Some modes use sample data (Arena CI’s bundled 6-model fixture) because raw Arena battle logs aren’t always public.
Quantization predictor is calibrated to publicly-reported PPL drops; novel architectures may sit outside the band.
Contamination prior is a Bayesian prior, not a detector — pair with CoDeC/Min-K%/PaCoST when you have GPU access.

I would rather call out limitations honestly than oversell. If the tool is wrong about your model, please tell me — refutations are taken as seriously as confirmations.

-–

## Why I built this

A lot of v0.7 came from one observation: there’s a paper trail of community frustration about each of these issues, but the existing solutions (RULER, CoDeC, Min-K%, LayerCast, HELMET) are all GPU-bound research artifacts , not tools you reach for at 11 PM when you’re trying to decide whether to buy compute for a model. A browser-only “predict before you spend” layer felt missing.

That said — TAF Agent doesn’t replace any of those tools. It’s the pre-flight check before you bring out the heavy artillery.

-–

## How you can help

Falsify a verdict. Run the tool, then run RULER / lm-eval / your downstream task. If we disagree with reality on a specific model, [open an issue]( Issues · karlesmarin/tafagent · GitHub ) with the model id + your numbers — that’s gold for calibration.
Suggest a benchmark for the contamination DB. If a benchmark you care about isn’t in the 20 we cover, add it.
Translate. EN/ES/FR/ZH covered; PRs welcome for more.

Built by one independent researcher with no funding, no team, and no GPUs beyond a single consumer card. The work itself belongs to the commons that made it possible.

-– Carles Marin

-–

If you find a real bug, email me or open an issue — I treat refutations as gifts.

Discussion in the ATmosphere