{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib3oy6yw2dxj5ilqvbd2oynqg4vh3h6iipudj32wl3k7dwluphzg4",
    "uri": "at://did:plc:svkyjirwpd7ts4qgnzoqfcc2/app.bsky.feed.post/3mnhumfzckab2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibwbof5pzx6bc4qiz6katmh463uxuf3n5ednxhkf3xdz2w2h4ceri"
    },
    "mimeType": "image/png",
    "size": 62667
  },
  "description": "An AI-powered WordPress sidebar that scores your content for SEO and generative engine discoverability. Built on wp-ai-client, it collects 14 signals from each post, sends a single prompt to Claude Haiku, and returns scores, metadata suggestions, and actionable flags for about a penny per analysis.",
  "path": "/2026/03/02/scoring-the-scorer-benchmarking-an-ai-content-evaluator/",
  "publishedAt": "2026-03-02T15:22:33.000Z",
  "site": "at://did:plc:svkyjirwpd7ts4qgnzoqfcc2/site.standard.publication/3mhpwfentz6lr",
  "tags": [
    "AI",
    "Core",
    "Plugin"
  ],
  "textContent": "I’ve been following what Automattic’s AI team has been building, and a feature called wp-ai-client is shipping in WordPress 7.0 core (due 9th April) feels like a genuine milestone. It means any plugin can make an AI call with a few lines of code, no API key management, no provider lock-in. I wanted to see what that looks like in practice, so I built a sidebar plugin that scores content for SEO and generative engine discoverability. Standing on wp-ai-client The reason this plugin exists at all is WordPress 7.0’s wp-ai-client (currently in beta), a unified AI infrastructure that will ship in core. Rich Tabor’s write-up explains the thinking: the same way WordPress fixed database access with $wpdb and HTTP requests with wp_http, wp-ai-client gives every plugin a single way to call any AI model without managing API keys, HTTP transports, or provider differences. For dgwltd-evaluate (the initial plugin name), this meant I could write wp_ai_client_prompt($prompt)->as_json_response($schema)->generate_text_result() and not care whether the model behind it is Anthropic, Google, or OpenAI. The credential management is centralised – one API key configured once, shared across every AI plugin on the site. The practical impact: the entire AI integration in dgwltd-evaluate is about 15 lines of code. Collect signals, build a prompt string, define a JSON schema, call the API, validate the response. Everything else – the sidebar UI, the caching, the Apply buttons – is standard WordPress plugin code. wp-ai-client turned what would have been a complex API integration into a single method chain. The problem with vibes-based validation My WordPress plugin uses AI to score content on two metrics: SEO (0–100) and GEO (0–100). GEO – Generative Engine Optimisation – measures how well AI tools can understand, summarise, and cite your content. It’s the new dimension that traditional SEO tools don’t touch. The plugin sits in the block editor sidebar. Press Analyse, and Claude Haiku reads your post’s signals – title length, heading structure, internal links, image alt text, taxonomy terms, meta description – then returns scores, metadata suggestions, and flags. About a penny per analysis. The problem: when you build a scoring rubric and hand it to an AI, how do you know the scores are right? You can spot-check a few posts and think “yeah, 72 feels about right for that one,” but that’s vibes-based validation. It doesn’t scale, it doesn’t catch drift, and it definitely doesn’t tell you whether your rubric is penalising the right things by the right amount. The synthetic benchmark approach The answer is controlled test data – posts where you know what score each one should get because you designed the signals deliberately. I built 18 synthetic posts across four groups: Quality tiers (5 posts): A “perfect” post with every signal lit up: 800 words, proper H2→H3 hierarchy, internal links, images with alt text, meta description, taxonomy terms, cited sources, a TL;DR. Then a “good” post with minor gaps, an “average” post that’s mediocre, a “weak” post with vague content and zero signals, and a “terrible” post that’s literally “test post please ignore.” SEO dimension isolation (6 posts): Each is a copy of the “perfect” post with exactly one SEO dimension removed. One has no meta description. One has no internal links. One has no images. One skips heading levels (H2→H4, no H3). One has a 103-character title. One has no taxonomy terms assigned. If the AI is correctly weighting each dimension, each should score roughly 10 points below the perfect baseline. GEO dimension isolation (4 posts): Same idea but for generative engine signals. One post is well-written but cites zero sources. One uses only vague references (“the tool”, “the framework”) instead of named entities. One is a wall of text with zero headings or lists. One is packed with statistics, named sources, and pithy quotable conclusions. Edge cases (3 posts): A 2000-word post, a post with four images all missing alt attributes, and a post with a slug longer than 60 characters. All the content is realistic web development prose – not lorem ipsum. The internal links use relative URLs (/about/, /services/) because the plugin’s link counter treats relative URLs as internal. The images point to placeholder paths because the signal extraction is regex-based and doesn’t verify file existence. These are implementation details that matter when you’re designing test data. What the signals actually look like The plugin’s Analyser class collects 14 signals from each post: title, content (plain text), slug, excerpt, featured image alt, meta description, word count, image count, images without alt, internal link count, heading structure, schema plugin, taxonomies, available terms Heading structure is extracted as a string, “H2, H3, H3, H2, H3”, so the AI can assess hierarchy. Internal links are counted by regex: any <a href=\"...\"> where the URL either has no host (relative) or matches the site’s domain. Images without alt are specifically images missing the alt attribute entirely. This is all fed into a prompt alongside the scoring rubric, and the AI returns structured JSON with scores, metadata, taxonomy suggestions, and flags. The first run was humbling – but for the wrong reason I expected my “perfect” post to score 85–95 on SEO. It scored 72. The relative ordering looked right: quality tiers ranked perfectly from top to bottom. But the absolute SEO scores were systematically ~13 points below expectations. My first instinct was a rubric calibration problem, the AI scoring too harshly. Then I looked closer. The meta description was empty. The excerpt was empty. The taxonomy terms were missing. The “perfect” post wasn’t actually perfect. The benchmark script had silently failed to set those signals. This is a core lesson about benchmarking: your test harness can corrupt your test data in ways that look like model problems. The actual results After the fix: PostSEOGEOWhat I expectedPerfect7882SEO 85–95, GEO 80–90Good6872SEO 65–75, GEO 55–65Average4238SEO 35–45, GEO 25–35Weak2218SEO 10–20, GEO 10–20Terrible85SEO 0–10, GEO 0–10 Quality tiers rank correctly. SEO is still slightly below my upper expectations, but GEO is well-calibrated, and the relative penalties are consistent. The SEO dimension isolation results What’s missingSEOExpected drop from 78No meta description72~6pts ✓No internal links68~10pts ✓No images72~6ptsBad heading hierarchy72~6pts ✓Long title (103 chars)62~16pts (harsh)No taxonomy72~6pts The AI penalises each missing dimension. Long titles get hit harder than expected. The rubric may be double-penalising for both the score dimension and flagging it. The GEO wall-of-text surprise The biggest surprise was benchmark-geo-no-structure, a deliberate wall of text with zero headings, zero lists, just two massive paragraphs. I expected GEO 35–60. It scored 78. The AI valued the content’s named entities (Steven Champeon, Jeremy Keith, W3C), cited claims (HTTP Archive statistics), and intellectual depth more heavily than its formatting structure. In hindsight, this might be correct for real GEO signal – a brilliantly written wall of text is probably more useful to an AI than a perfectly structured post with no substance. But if the plugin is supposed to reward and encourage good structure, the rubric needs to make that explicit. Tuning the rubric in one sentence I added a single guidance sentence to PromptTemplate::scoring_rubric(): “GEO structure guidance: content with no headings or lists can score at most 5/15 for structure. Well-structured content uses H2/H3 hierarchy, bullet/numbered lists, and clear section breaks that allow AI systems to extract discrete facts.” Result: PostGEO beforeGEO aftergeo-no-structure7858perfect8282geo-high-quotability8787 The wall of text was penalised. Well-structured posts were unaffected. The rubric changed; the model logic followed. Then I ran a stability test, the same post analysed 5×, and got zero variance. SEO 78, GEO 82, Overall 80 on every run. Anthropic’s constrained decoding on the JSON schema, plus Haiku’s deterministic tendencies, means rubric changes produce real signal with no noise. After the rubric update: 18/18 benchmarks PASS. Total cost per full run: ~$0.18. What I learned about testing AI scoring systems The biggest takeaway is to check your scaffolding before blaming the rubric. I spent time thinking the AI was scoring too harshly when the real problem was a shell script printing to stdout. I also stopped worrying about absolute calibration. If “perfect” scores 78 instead of 90 but the quality tiers rank correctly from top to bottom, the rubric is working. The thing to panic about is rankings inverting, not the ceiling being lower than you’d like. The dimension isolation technique turned out to be the most useful part of the whole exercise. Taking a “perfect” post and removing exactly one signal tells you precisely how much weight the AI gives each dimension. It’s how I caught the long-title over-penalty and the wall-of-text under-penalty. And the stability results were a genuine surprise. I ran the same post five times and got identical scores every time. Zerovariance. Most AI outputs have some randomness, so you can never be sure if a score changed because of your rubric edit or justmodel noise. With zero variance, every change is causal – edit the rubric, re-run, and the difference is entirely down to whatyou changed. At $0.18 per full run, that makes iteration very cheap. The honest caveat is that a score is only as good as the assumptions behind it. SEO scoring has decades of reverse-engineeringbehind it, and even then nobody really knows how Google weighs things internally. GEO is much newer than that. Tools likeKnown Agents and Semrush’s AI Visibility feature are starting to surface how LLMs actually consume web content, but we’re stillguessing at a lot of it. A well-structured post with cited sources probably helps an AI summarise you accurately – but“probably” is doing heavy lifting in that sentence. The benchmarks validate that the rubric is internally consistent. Whetherthose scores translate to better real-world outcomes is a different question, and one nobody can fully answer yet. Not everything needs AI The benchmark process also clarified which checks shouldn’t involve AI at all. If a post has no excerpt, that’s a fact, the AI doesn’t need to “decide” it. If images are missing alt attributes, a regex can count them cheaper and more reliably than a language model. So I pulled five conditions out into deterministic flags, checks that always fire when the condition is true, independent of any AI call: Word count below 300 No excerpt set No meta description Images missing alt attributes No taxonomy terms assigned These run before the AI flags and get prepended to the results. The AI still adds contextual flags (“your meta description should mention WCAG 2.1 AA”), but the binary yes/no checks are now guaranteed consistent. It’s a small architectural decision, but it removes the last category of scoring variance from the system. A UI shell around a single AI call Stepping back from the benchmarking, the thing I keep coming back to is how little code is actually here. The entire plugin is a UI shell around a single AI call. Collect signals, build a prompt, send it, display the result. That’s it. WordPress enables so many things now with packages and core functionality, a lot of this is now just plumbing. 11 PHP classes, 6 React components, one CSS file. The Analyser collects signals and calls the AI. The PromptTemplate assembles the rubric and schema. The RestController routes three endpoints. The sidebar renders scores, suggestions, and flags. There’s no background processing, no database tables, no cron jobs, no admin dashboard. Just a button that says Analyse. This architecture feels like a pattern worth repeating. An accessibility auditor, a readability scorer, a content freshness checker – they’re all the same shape: collect signals from a WordPress entity, send a structured prompt, get JSON back, display it in the sidebar with Apply buttons. The parts that change are the signals you collect, the rubric you write, and the schema you define. Everything else – the cache, the REST layer, the sidebar state machine, the dual-write Apply pattern – is identical.",
  "title": "Scoring the Scorer: Benchmarking an AI Content Evaluator",
  "updatedAt": "2026-04-24T09:36:15.000Z"
}