I built a snark detector and pointed it at myself
Justin Stanley
June 22, 2026
> Difficile est saturam non scribere. — Juvenal
> ("It is difficult not to write satire.")
I keep a social account that's pure relief valve — dry, sarcastic — so I pointed
some tooling at myself: can a local model measure my sarcasm over time? Not
vibes — a number I can argue with.
This isn't my first sarcasm detector. I've built them since grad school, back when
"language model" meant feature engineering and a lot of praying over a corpus.
Sarcasm was the hard problem — the words betray the meaning, and the old tools
only read words. So this wasn't about the snark; it was a test of how much easier
that's gotten, and where it hasn't.
The metric: words vs. meaning
Off-the-shelf sentiment analysis reads the literal words. Point a classic lexicon
scorer at "Thanks Biden." and it cheerfully calls it positive. Run it across a few
thousand of my posts and I come out sunny and well-adjusted, which anyone who's
read them knows is wrong.
The signal I cared about lives in the gap between what the words say and what I
mean. So I had the model score both:
- vlit — face value, sarcasm ignored
- v — intended emotion, sarcasm flipped
The difference, vlit − v, is the Snark Index: how much sunnier I read than I
mean. That one number turned out to be the whole project.
{{< figure src="/img/snark-index.png" alt="The Snark Index charted over time" caption="The Snark Index over time — how much sunnier the words read than I mean them." >}}
The build
It all runs on my lab network. The Mac is the lab machine — it has the GPU, so it
runs the model (a local qwen3 32B; no API bills, nothing leaves the house). The
NAS is the storage and compute layer — Postgres for the data, Grafana for the
charts. The split fell out of that:
The scorer connects to Postgres across the lab network and writes directly —
Postgres handles the concurrent access, so there's no relay in between. It's all
Go, two small binaries — one that pulls new posts and rolls up daily stats, one
that scores — plus a Grafana dashboard. Parameterized by handle, runs nightly
without me.
{{< figure src="/img/snark-dashboard.png" alt="The Grafana dashboard" caption="The dashboard: Snark Index, intended-vs-literal valence, sarcasm rate, and topic." >}}
What it found
Across about 3,700 posts and three years: roughly one in four is sarcastic,
and the Snark Index lands solidly positive, around +0.3. I read sunnier than
I mean — my actual intent runs a touch negative (deadpan does that), while the
words on their own scan cheerier. The model gives me more credit for warmth than
my delivery earns. Make of that what you will; I haven't decided what I make of it.
The part that's still hard
The "much easier" story has a catch. Getting from detecting sarcasm
to scoring what I meant took three tries, and the smartest one lost.
Try one flagged the irony correctly — it knew "thank Big Brother for raising
the chocolate ration" was sarcastic — and then logged it as cheerful anyway.
Right flag, wrong feeling. I only caught it because the "positive" pile was
suspiciously full of sarcasm.
Try two asked for two numbers, face value and intent, and that surfaced the
genuinely hard case: irony with a target. "Thanks Biden" isn't a sentiment, it's a
format. Sometimes it's blame. Sometimes it's a sincere thank-you wearing the
blame-meme as a costume — me actually crediting the tax credit that paid down the
solar panels, while mocking the people who say it straight. Same three words,
opposite meaning, and the only tell is what it's replying to.
Try three was me being clever: I taught the prompt about that nuance, expecting
sharper results. It got worse. Spelling out "sometimes sarcasm is sincere" made
the model gun-shy — its sarcasm detection collapsed from a quarter of posts to
about seven percent, and it started calling things like "truly, a stable genius.
that was sarcasm." positive. It read the words "that was sarcasm" and shrugged.
The blunt version — flag aggressively, then flip — beat the nuanced one outright.
Easier, not solved
That's the real finding, and it's a better one than any number on the dashboard.
The intelligence got cheap: a decade ago, getting a model to spot irony at all
was a thesis; today it's a prompt and a local GPU, and nearly all my effort went
into boring solved things — moving data, a schema, a dashboard. But the last mile
didn't move. Detecting that something is ironic is easy now. Reading who the
irony is aimed at — that's still judgment, still context, still the most human
thing in the pile. And it's humbling that you can make a model dumber by
explaining the hard part to it.
So the dashboard has opinions about me — mostly neutral, occasionally warm,
reliably grim about the news, and sunnier on the surface than underneath. Whether
it's right is its own argument, and one I haven't settled. But that it can
venture a read at all, sarcasm and all, is the thing. A decade ago that would have
been the whole project. Today it's the part that comes cheap — and the hard part
is exactly where it always was.
Discussion in the ATmosphere