Building Local: My 2026 Headless AI Server Journey
Project Summary: Multi-Server Cluster Optimization
Timeline: April 24, 2026 (approx. 3 hours of tuning)
1. Infrastructure & UI Upgrade
Update: Migrated the Lenovo ThinkServer to Open WebUI v2.0 (v0.9.2).
Result: Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.
2. Model Deployment: Gemma 4 26B
Model: Deployed Gemma 4 26B-A4B-it (Q4_K_M).
Advantage: Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.
3. Hardware Tuning: “The 97% Sweet Spot”
GPU: AMD Radeon 7800 XT (16GB VRAM).
Config: Manually tuned to 27 layers with a 16,000 context window.
Result: Achieved 97% VRAM utilization. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.
4. Benchmarks
| Metric | Performance |
|---|---|
| Generation Speed | 25.13 tokens/s (Instantaneous feel) |
| Prompt Processing | 283.03 tokens/s (Fast large-file reading) |
| Stability | 100% stable at 97% load (Headless mode) |
Discussion in the ATmosphere