{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifac2m5h2upnjemnygh6r7fs5gfea2fpipkgf7fddze5bw6hr4c2y",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp2vh42xeoc2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreifjvjuokvkzg2c3gw3kqqgrm4biyrxnwxgjs4sqtar3gvrvsxwjhq"
},
"mimeType": "image/webp",
"size": 44594
},
"path": "/aaroncarlisle94/i-built-a-00005-screenshot-cropper-that-saves-ai-agents-95-on-vision-llm-costs-2c41",
"publishedAt": "2026-06-24T21:20:23.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"webdev",
"agents",
"blockchain",
"https://x402-vision-cropper.onrender.com/llms.txt"
],
"textContent": "If you're building AI agents that work with browser screenshots, you already know the pain.\n\nYou take a full 1920×1080 screenshot, pass it to GPT-4o or Claude, and watch your token bill climb — while the model downscales the image anyway and blurs the exact text you needed it to read.\n\nThere's a better way.\n\n## The problem\n\nVision LLMs are expensive for two reasons when you feed them full screenshots:\n\n 1. **Token cost** — a full screenshot can cost 10–20x more tokens than a small crop\n 2. **Accuracy loss** — models internally downscale large images, blurring fine text, labels, and UI elements\n\n\n\nBut your agent already knows _where_ to look. Browser automation tools like Playwright and Puppeteer give you `getBoundingClientRect()` — the exact pixel coordinates of any element on screen.\n\nSo why are you sending the whole screenshot?\n\n## The solution\n\nI built a stateless pay-per-use API that takes a screenshot and pixel coordinates, and returns just the cropped element as a lossless PNG — ready to pass directly to your vision LLM.\n\n\n\n POST /crop\n {\n \"image\": \"<base64 screenshot>\",\n \"x\": 120,\n \"y\": 45,\n \"width\": 640,\n \"height\": 80\n }\n\n\nReturns:\n\n\n\n {\n \"success\": true,\n \"data\": {\n \"base64\": \"iVBORw0KGgo...\",\n \"mime\": \"image/png\",\n \"width\": 640,\n \"height\": 80,\n \"bytes\": 4821\n }\n }\n\n\nA 4KB crop instead of a 2MB screenshot. Same information. 95% fewer tokens.\n\n## How payment works\n\nHere's where it gets interesting. The API uses the **x402 payment protocol** — HTTP's long-dormant 402 Payment Required status code, finally put to use.\n\nThere are no API keys. No accounts. No subscriptions. The agent pays $0.0005 USDC per crop on Base L2 automatically.\n\nThe flow:\n\n\n\n 1. Agent POSTs to /crop (no payment header)\n ← 402 with payment instructions in headers\n\n 2. Agent transfers 0.0005 USDC to recipient wallet on Base\n (near-zero gas, ~2 second settlement)\n\n 3. Agent POSTs again with x-payment-tx-hash header\n ← 200 with cropped PNG\n\n\nThe entire exchange happens inside the HTTP request cycle. No human intervention. No billing dashboard. The money lands directly in the operator's wallet on-chain.\n\n## Real agent integration\n\nHere's what using it looks like in a Playwright agent:\n\n\n\n import { chromium } from 'playwright';\n import { readFileSync } from 'fs';\n\n const browser = await chromium.launch();\n const page = await browser.newPage();\n await page.goto('https://example.com/dashboard');\n\n // Take screenshot\n await page.screenshot({ path: 'screen.png' });\n const imageB64 = readFileSync('screen.png').toString('base64');\n\n // Get element coordinates\n const rect = await page.$eval('.price-display', el => el.getBoundingClientRect().toJSON());\n\n // Probe the API for payment instructions\n const probe = await fetch('https://x402-vision-cropper.onrender.com/crop', {\n method: 'POST',\n headers: { 'Content-Type': 'application/json' },\n body: JSON.stringify({\n image: imageB64,\n x: Math.floor(rect.x),\n y: Math.floor(rect.y),\n width: Math.floor(rect.width),\n height: Math.floor(rect.height),\n }),\n });\n\n // → 402 response with payment details in headers\n const recipient = probe.headers.get('x-payment-recipient');\n const amount = probe.headers.get('x-payment-price-usdc');\n\n // Pay on Base L2 using viem\n const txHash = await sendUsdc({ recipient, amount }); // your wallet logic here\n\n // Resubmit with payment proof\n const result = await fetch('https://x402-vision-cropper.onrender.com/crop', {\n method: 'POST',\n headers: {\n 'Content-Type': 'application/json',\n 'x-payment-tx-hash': txHash,\n },\n body: JSON.stringify({\n image: imageB64,\n x: Math.floor(rect.x),\n y: Math.floor(rect.y),\n width: Math.floor(rect.width),\n height: Math.floor(rect.height),\n }),\n });\n\n const { data } = await result.json();\n\n // Pass the tiny crop to your vision LLM instead of the full screenshot\n const response = await openai.chat.completions.create({\n model: 'gpt-4o',\n messages: [{\n role: 'user',\n content: [\n { type: 'image_url', image_url: { url: `data:${data.mime};base64,${data.base64}` } },\n { type: 'text', text: 'What is the price shown?' }\n ]\n }]\n });\n\n\n## The architecture\n\nThe server is intentionally minimal:\n\n * **Fastify** on Node.js — low memory footprint\n * **Sharp** for image processing — in-RAM only, no disk writes\n * **Zero persistent storage** — every request is stateless, data exists only for the duration of the request\n * Runs on a **512MB single-CPU container** on Render\n\n\n\nThe entire codebase is about 400 lines across 7 files. No database. No session state. No auth layer beyond the payment itself.\n\n## Try it\n\nThe API is live now:\n\n\n\n # Check it's running\n curl https://x402-vision-cropper.onrender.com/health\n\n # Trigger the payment challenge\n curl -X POST https://x402-vision-cropper.onrender.com/crop \\\n -H \"Content-Type: application/json\" \\\n -d '{\"image\":\"'\"$(python3 -c \"print('A'*200)\")\"'\",\"x\":0,\"y\":0,\"width\":10,\"height\":10}'\n\n\nMachine-readable docs for agents: https://x402-vision-cropper.onrender.com/llms.txt\n\n## What I learned building this\n\n**x402 is genuinely exciting but very early.** The protocol works cleanly — payment instructions in headers, proof in the retry, settlement on-chain. But the agent ecosystem is still catching up. Most frameworks don't have native wallet support yet.\n\n**Stateless by design is underrated.** No database means no breach, no GDPR headache, no backup strategy, no connection pooling. Every request lives and dies in RAM. For a high-throughput API that processes sensitive screenshot data this is the right architecture.\n\n**The unit economics make sense at scale.** At $0.0005 per crop the service costs less than a rounding error compared to what it saves on vision tokens. The challenge isn't pricing — it's volume.\n\nIf you're building browser agents or anything that feeds screenshots to vision models, give it a try. And if you're building in the x402 / agentic payments space I'd love to hear what you're working on.",
"title": "I built a $0.0005 screenshot cropper that saves AI agents 95% on vision LLM costs"
}