{
  "$type": "site.standard.document",
  "content": "---\ntitle: \"Ask Ben (powered by Gemma 4)\"\ndescription: \"An in-browser chat widget running Google's Gemma 4 E2B via WebGPU, primed\n  with all the content from this site so you can ask it questions about me.\"\ntags: [ai, web]\n---\n\nimport GemmaChat from \"@/components/svelte/GemmaChat.svelte\";\n\nGoogle [just released the Gemma 4 model family](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)---their\nmost capable open models yet, and the smallest variants are small enough to run\nentirely in your browser. The E2B model has 2.3 billion effective parameters\n(5.1B total), where the \"E\" stands for \"efficient\"---it uses a technique called\nPer-Layer Embeddings that reduces compute at inference time while keeping the\nmodel's full representational capacity.[^ple]\n\n[^ple]:\n    Each decoder layer gets its own small embedding table rather than sharing\n    a single large one. This means the total parameter count is higher than\n    the effective count, but the model only activates the embeddings it needs\n    for each layer. Clever trick.\n\nSo naturally I had to see if I could get it running on this site. The widget\nbelow loads the [LiteRT web build](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm)\nof Gemma 4 E2B via Google's\n[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js)\nand WebGPU, then primes it with content from this site---bio, CV, research\ninterests, and blog post titles---as its system prompt. You can ask it questions\nabout me and my work, and it'll do its best to answer from that context.\n\n:::warning\n\nSome caveats, because they matter. This is a 2.3B parameter language model, not\nactually me. It will hallucinate, get things wrong, and may confidently\ntell you I have opinions I don't hold or have done things I haven't done. Treat\nit as a fun experiment in on-device inference, not as a reliable source of\ninformation about me.\n\nIt requires a desktop browser with WebGPU support (Chrome, Edge, Safari 17+,\nor Firefox 141+), and the initial model download is around 2 GB---though it's cached in\nyour browser after the first visit. Speed depends on your GPU---on a decent\ndiscrete GPU it should be fairly responsive, but on integrated graphics expect\nit to be slower. This is a 2B model doing inference in your browser, not a\ncloud API.\n\n:::\n\n<GemmaChat client:only=\"svelte\" />\n",
  "createdAt": "2026-05-13T23:14:37.386Z",
  "description": "An in-browser chat widget running Google's Gemma 4 E2B via WebGPU, primed with all the content from this site so you can ask it questions about me.",
  "path": "/blog/2026/04/03/ask-ben-powered-by-gemma-4",
  "publishedAt": "2026-04-03T00:00:00.000Z",
  "site": "at://did:plc:tevykrhi4kibtsipzci76d76/site.standard.publication/self",
  "tags": [
    "ai",
    "web"
  ],
  "textContent": "An in-browser chat widget running Google's Gemma 4 E2B via WebGPU, primed with all the content from this site so you can ask it questions about me.",
  "title": "Ask Ben (powered by Gemma 4)"
}