External Publication
Visit Post

How to Build a Private Offline Voice Assistant with Gemma 4 12B: A Complete Local Setup Guide

DEV Community [Unofficial] June 17, 2026
Source

How to Build a Private Offline Voice Assistant with Gemma 4 12B: A Complete Local Setup Guide

A developer’s guide to running Google’s 11.95B-parameter multimodal model with local STT/TTS on a 16 GB laptop under Apache 2.0.

TL;DR: Download Gemma 4 12B (~6.7 GB at 4-bit) into a local runtime such as Google AI Edge Gallery, pair it with a local STT/TTS stack, and expose a local endpoint. The 11.95B-parameter model fits on a 16 GB laptop, runs offline under Apache 2.0, and keeps all voice data on-device.

Check Hardware Constraints and the 30-Second Audio Limit

Before downloading the model, verify your machine has at least 16 GB of RAM and plan your voice pipeline around the model’s strict 30-second audio ceiling. At 4-bit quantization, Gemma 4 12B’s 11.95 billion parameters compress to roughly 6.7 GB. After loading the weights, the remaining ~9 GB must cover the operating system, the inference framework overhead, and any local audio capture or STT services. If you are running other local models or Home Assistant addons concurrently, budget even more conservatively.

Check available memory before launching the stack:

free -h

Aim to have at least 14 GB free at idle; anything less risks swapping during inference.

The model enforces a hard 30-second audio limit. Exceeding it will cause inference to fail or truncate, so your client must enforce a maximum recording duration. A common approach is to chunk incoming streams or fall back to text input for complex multi-part commands. Split existing recordings at the boundary with ffmpeg:

ffmpeg -i input.wav -f segment -segment_time 30 -c copy chunk_%03d.wav

This produces 30-second WAV files that stay within the limit. Feed each chunk separately, or switch to a text fallback when a user’s utterance exceeds one segment.

Install a Local Inference Runtime

To run Gemma 4 12B without external API calls, install a local inference runtime first. The Google AI Edge Gallery is one supported deployment option for both phones and laptops, and releases are delivered as standard OS-specific packages: a Windows .exe installer, a macOS zip bundle, and a Linux package.

Because this runtime serves as the execution backend for your voice pipeline, completing the installation before downloading model weights avoids path and permission errors during setup.

On macOS, download the zip archive, extract it, and drag the resulting application into your system Applications folder. Standard user permissions are sufficient for most local inference workloads when the app resides in the Applications directory. If you prefer the command line, a common approach is to locate the downloaded bundle and move it in one step:

cd ~/Downloads && unzip *.zip && mv *.app /Applications/

On Windows, launch the downloaded .exe installer and proceed through the setup prompts until the wizard finishes. A per-user install is usually adequate, with administrator elevation required only if you explicitly choose a system-wide program directory. You can also trigger the installer non-interactively once it is saved to your Downloads folder:

$exe = Get-ChildItem "$env:USERPROFILE\Downloads" -Filter *.exe | Select-Object -First 1
Start-Process -FilePath $exe.FullName -Wait

Linux users should install the provided package using the distribution’s native package manager; because formats vary by release, refer to the supplied readme for the exact dpkg, rpm, or AppImage command. After installation completes on any platform, open the runtime and verify that the local inference engine is active before pulling the Gemma 4 12B weights. Keeping this layer fully offline ensures voice data never leaves the device.

Load Gemma 4 12B at 4-Bit Quantization

Loading Gemma 4 12B at 4-bit quantization reduces its memory footprint to roughly 6.7 GB, letting the entire model stay resident in RAM on a 16 GB laptop. Select the 4-bit option in your local inference UI or configuration file immediately after importing the model weights.

At 11.95 billion parameters, the full-precision weights would exceed typical consumer memory limits, but 4-bit compression brings private, on-device deployment within reach. In tools like Google AI Edge Gallery, select the 4-bit quantization profile during the model-import step. Because Gemma 4 is encoder-free and processes audio in a single pass, keeping the entire model in RAM is especially critical—any disk access during inference multiplies latency for multimodal inputs. After initialization, verify the process is locked in physical memory and not swapping before you attach speech-to-text or text-to-speech services; even occasional paging destroys the low latency required for conversational voice interaction. On Linux, confirm swap usage is zero with:

grep VmSwap /proc/$(pgrep -f gemma)/status

On macOS, monitor memory pressure while the model loads:

memory_pressure && vm_stat 1

If you see swap growth or pressure warnings, reduce the context window or close other applications until the process stabilizes entirely in RAM. A fully resident model avoids the round-trip disk delay that would otherwise make real-time assistant responses unusable. Treat this verification as a mandatory gate: only after confirming stable, swap-free residency should you layer on the speech pipeline.

Wire Up Local STT and TTS Components

Because current local assistant frameworks still require separate speech and model components, you must bridge a local STT engine and a local TTS service to Gemma 4 12B; the STT text feeds into Gemma’s text context, and the generated reply routes to the TTS service, even though the model natively ingests audio in a single pass. Until front-end conversation agents expose that native audio path, a text pipeline is the only workable architecture, and it conveniently sidesteps Gemma’s hard 30-second audio limit. Splitting the pipeline this way also lets you upgrade either speech component independently of the quantized model.

For the STT layer, a local ONNX/Parakeet model can deliver subsecond transcription latency. Load the ONNX graph and run inference on the captured waveform:

import numpy as np, onnxruntime as ort
session = ort.InferenceSession("parakeet.onnx")
inputs = {session.get_inputs()[0].name: waveform}
text = session.run(None, inputs)[0]

Pass the resulting transcript to your local Gemma endpoint. A common pattern is to POST the text to a local inference server and stream back the response:

import requests, json
r = requests.post("http://localhost:11434/api/generate",
    json={"model": "gemma4:12b", "prompt": text, "stream": False})
reply = r.json()["response"]

Finally, send the reply to a local TTS service. A typical setup pushes the synthesized string to an on-device Piper or similar HTTP endpoint and writes the returned audio to your speaker queue:

audio = requests.post("http://localhost:5000/tts",
    json={"text": reply}, stream=True)
# playback(audio.content)

This keeps the full loop offline: the STT model runs locally, Gemma runs locally, and the TTS service runs locally.

Build the Voice Command Loop

Build the voice command loop by capturing microphone audio, sending it to a local STT service, forwarding the resulting transcript to your local Gemma 4 inference endpoint, and passing the generated reply to a local TTS engine for immediate playback. You must enforce the model’s hard 30-second audio cap by halting microphone capture before that limit; anything longer will exceed the model’s single-pass audio window and trigger truncation or rejection.

A common approach is to record raw PCM audio with sounddevice, flush it to a 16 kHz mono WAV file, and POST it to a local Whisper-compatible STT server listening on port 9000. Once the STT returns the transcript, construct a concise prompt formatted as a direct action command in the style of the Voice Edit pattern—phrasing like “Restructure these notes into an executive summary” or “Translate this into Hindi”—and POST that payload to your local Gemma 4 inference endpoint running under Ollama or llama.cpp on localhost. Avoid conversational preamble; the model executes faster when the instruction is explicit and scoped to a single action.

import sounddevice as sd, requests, wave
frames = sd.rec(int(30 * 16000), samplerate=16000,
                channels=1, dtype='int16')
sd.wait()
with wave.open("cmd.wav", "wb") as f:
    f.setnchannels(1); f.setsampwidth(2); f.setframerate(16000)
    f.writeframes(frames.tobytes())

Submit the recorded file to the STT layer and retrieve the text:

curl -X POST http://localhost:9000/v1/audio/transcriptions \
  -F file="@cmd.wav" -F model="whisper-base"

Forward the transcript to the local Gemma 4 API as an explicit agentic instruction:

r = requests.post("http://localhost:11434/api/generate",
    json={"model": "gemma4:12b",
          "prompt": f"Restructure into an executive summary: {transcript}"})
reply = r.json()["response"]

Finally, push the model’s text output to your local TTS service—such as Piper or Coqui—and play the synthesized audio stream through your default sound device. Keep the loop strictly sequential: record, transcribe, infer, speak, then return to listening so only one audio stream is active at any moment and the pipeline stays synchronized.

Keep the Stack Offline and Private

Keep the stack offline by binding the inference client to a local-only endpoint and dropping all outbound traffic for the process at the system firewall. Because Gemma 4 12B ships under Apache 2.0 with no commercial restrictions and inference runs entirely on-device, no audio or text ever leaves your machine, and all sensitive multimodal data remains on the hardware that owns it.

At the client layer, disable remote base URLs and any automatic fallback to hosted APIs. A common approach is to initialize the SDK against a local inference server so every request stays on the loopback interface and never attempts an external resolver:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:11434/v1",
    api_key="not-needed-local"
)

For a hard offline guarantee, add a firewall rule that denies the voice assistant process any outbound route:

sudo iptables -A OUTPUT -m owner --uid-owner assistant -j DROP

The 11.95 billion parameter weights compress to roughly 6.7 GB at 4-bit quantization, so the full audio-to-text pipeline executes in local RAM without cloud encoders or API dependencies. The hard 30-second audio limit also bounds each inference batch to what fits on-device. After starting the assistant, verify isolation by capturing packets during a voice query: if traffic leaves the loopback adapter, the stack is not truly offline.

FAQ

Can I send raw audio directly into Gemma 4 12B and skip STT?

The encoder-free architecture reads audio in a single pass, but most local assistant platforms and conversation agents still require separate speech components today. Until those frameworks natively stream raw audio to the model, you should keep a local STT layer in the pipeline.

Will this run on a laptop with only 16 GB of shared RAM?

Yes. Google DeepMind specifies that Gemma 4 12B runs on a 16 GB laptop. At 4-bit quantization the weights occupy roughly 6.7 GB, leaving headroom for the OS and your STT/TTS services if you manage memory carefully.

What happens if my voice command is longer than 30 seconds?

Gemma 4 12B has a hard 30-second audio limit. A common approach is to chunk long utterances or switch to a text-based prompt once you exceed that boundary.

Do I need an internet connection after the initial setup?

No. The model is Apache 2.0 and open-weight, so after you download the quantized weights and install the local runtime, the entire voice assistant operates offline. No API keys or cloud endpoints are required.

Is there any legal risk using this for a commercial product?

No. The weights ship under Apache 2.0 with no commercial restrictions, removing legal friction for on-device deployments.

References for further reading

Sources consulted while researching this guide, included so you can verify the details and go deeper. Listing them is not a claim that every line was independently fact-checked.

  • Self-Hosting Gemma 4 12B: Local Deployment Guide
  • I Built a Local AI Agent with Gemma 4 — Runs Fully Offline
  • Gemma 4 for Offline HA Assistant

I packaged the setup above into a ready-to-use kit — **Gemma 4 12B Local Multimodal Build Kit (13 Items) * — for anyone who'd rather copy-paste than wire it from scratch: https://unfairhq.gumroad.com/l/nkylsz.*

Discussion in the ATmosphere

Loading comments...