Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifac2m5h2upnjemnygh6r7fs5gfea2fpipkgf7fddze5bw6hr4c2y",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp2vh42xeoc2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreifjvjuokvkzg2c3gw3kqqgrm4biyrxnwxgjs4sqtar3gvrvsxwjhq"
    },
    "mimeType": "image/webp",
    "size": 44594
  },
  "path": "/aaroncarlisle94/i-built-a-00005-screenshot-cropper-that-saves-ai-agents-95-on-vision-llm-costs-2c41",
  "publishedAt": "2026-06-24T21:20:23.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "webdev",
    "agents",
    "blockchain",
    "https://x402-vision-cropper.onrender.com/llms.txt"
  ],
  "textContent": "If you're building AI agents that work with browser screenshots, you already know the pain.\n\nYou take a full 1920×1080 screenshot, pass it to GPT-4o or Claude, and watch your token bill climb — while the model downscales the image anyway and blurs the exact text you needed it to read.\n\nThere's a better way.\n\n##  The problem\n\nVision LLMs are expensive for two reasons when you feed them full screenshots:\n\n  1. **Token cost** — a full screenshot can cost 10–20x more tokens than a small crop\n  2. **Accuracy loss** — models internally downscale large images, blurring fine text, labels, and UI elements\n\n\n\nBut your agent already knows _where_ to look. Browser automation tools like Playwright and Puppeteer give you `getBoundingClientRect()` — the exact pixel coordinates of any element on screen.\n\nSo why are you sending the whole screenshot?\n\n##  The solution\n\nI built a stateless pay-per-use API that takes a screenshot and pixel coordinates, and returns just the cropped element as a lossless PNG — ready to pass directly to your vision LLM.\n\n\n\n    POST /crop\n    {\n      \"image\":  \"<base64 screenshot>\",\n      \"x\":      120,\n      \"y\":      45,\n      \"width\":  640,\n      \"height\": 80\n    }\n\n\nReturns:\n\n\n\n    {\n      \"success\": true,\n      \"data\": {\n        \"base64\": \"iVBORw0KGgo...\",\n        \"mime\":   \"image/png\",\n        \"width\":  640,\n        \"height\": 80,\n        \"bytes\":  4821\n      }\n    }\n\n\nA 4KB crop instead of a 2MB screenshot. Same information. 95% fewer tokens.\n\n##  How payment works\n\nHere's where it gets interesting. The API uses the **x402 payment protocol** — HTTP's long-dormant 402 Payment Required status code, finally put to use.\n\nThere are no API keys. No accounts. No subscriptions. The agent pays $0.0005 USDC per crop on Base L2 automatically.\n\nThe flow:\n\n\n\n    1. Agent POSTs to /crop (no payment header)\n       ← 402 with payment instructions in headers\n\n    2. Agent transfers 0.0005 USDC to recipient wallet on Base\n       (near-zero gas, ~2 second settlement)\n\n    3. Agent POSTs again with x-payment-tx-hash header\n       ← 200 with cropped PNG\n\n\nThe entire exchange happens inside the HTTP request cycle. No human intervention. No billing dashboard. The money lands directly in the operator's wallet on-chain.\n\n##  Real agent integration\n\nHere's what using it looks like in a Playwright agent:\n\n\n\n    import { chromium } from 'playwright';\n    import { readFileSync } from 'fs';\n\n    const browser = await chromium.launch();\n    const page    = await browser.newPage();\n    await page.goto('https://example.com/dashboard');\n\n    // Take screenshot\n    await page.screenshot({ path: 'screen.png' });\n    const imageB64 = readFileSync('screen.png').toString('base64');\n\n    // Get element coordinates\n    const rect = await page.$eval('.price-display', el => el.getBoundingClientRect().toJSON());\n\n    // Probe the API for payment instructions\n    const probe = await fetch('https://x402-vision-cropper.onrender.com/crop', {\n      method:  'POST',\n      headers: { 'Content-Type': 'application/json' },\n      body:    JSON.stringify({\n        image:  imageB64,\n        x:      Math.floor(rect.x),\n        y:      Math.floor(rect.y),\n        width:  Math.floor(rect.width),\n        height: Math.floor(rect.height),\n      }),\n    });\n\n    // → 402 response with payment details in headers\n    const recipient = probe.headers.get('x-payment-recipient');\n    const amount    = probe.headers.get('x-payment-price-usdc');\n\n    // Pay on Base L2 using viem\n    const txHash = await sendUsdc({ recipient, amount }); // your wallet logic here\n\n    // Resubmit with payment proof\n    const result = await fetch('https://x402-vision-cropper.onrender.com/crop', {\n      method:  'POST',\n      headers: {\n        'Content-Type':       'application/json',\n        'x-payment-tx-hash':  txHash,\n      },\n      body: JSON.stringify({\n        image:  imageB64,\n        x:      Math.floor(rect.x),\n        y:      Math.floor(rect.y),\n        width:  Math.floor(rect.width),\n        height: Math.floor(rect.height),\n      }),\n    });\n\n    const { data } = await result.json();\n\n    // Pass the tiny crop to your vision LLM instead of the full screenshot\n    const response = await openai.chat.completions.create({\n      model: 'gpt-4o',\n      messages: [{\n        role: 'user',\n        content: [\n          { type: 'image_url', image_url: { url: `data:${data.mime};base64,${data.base64}` } },\n          { type: 'text', text: 'What is the price shown?' }\n        ]\n      }]\n    });\n\n\n##  The architecture\n\nThe server is intentionally minimal:\n\n  * **Fastify** on Node.js — low memory footprint\n  * **Sharp** for image processing — in-RAM only, no disk writes\n  * **Zero persistent storage** — every request is stateless, data exists only for the duration of the request\n  * Runs on a **512MB single-CPU container** on Render\n\n\n\nThe entire codebase is about 400 lines across 7 files. No database. No session state. No auth layer beyond the payment itself.\n\n##  Try it\n\nThe API is live now:\n\n\n\n    # Check it's running\n    curl https://x402-vision-cropper.onrender.com/health\n\n    # Trigger the payment challenge\n    curl -X POST https://x402-vision-cropper.onrender.com/crop \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\"image\":\"'\"$(python3 -c \"print('A'*200)\")\"'\",\"x\":0,\"y\":0,\"width\":10,\"height\":10}'\n\n\nMachine-readable docs for agents: https://x402-vision-cropper.onrender.com/llms.txt\n\n##  What I learned building this\n\n**x402 is genuinely exciting but very early.** The protocol works cleanly — payment instructions in headers, proof in the retry, settlement on-chain. But the agent ecosystem is still catching up. Most frameworks don't have native wallet support yet.\n\n**Stateless by design is underrated.** No database means no breach, no GDPR headache, no backup strategy, no connection pooling. Every request lives and dies in RAM. For a high-throughput API that processes sensitive screenshot data this is the right architecture.\n\n**The unit economics make sense at scale.** At $0.0005 per crop the service costs less than a rounding error compared to what it saves on vision tokens. The challenge isn't pricing — it's volume.\n\nIf you're building browser agents or anything that feeds screenshots to vision models, give it a try. And if you're building in the x402 / agentic payments space I'd love to hear what you're working on.",
  "title": "I built a $0.0005 screenshot cropper that saves AI agents 95% on vision LLM costs"
}