{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibnaodzcvuca5hm5jvwitnyfrxaf6ekgzs6veyuzbaw7vpyq36i5m",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkbchtfjq6f2"
},
"path": "/t/building-local-my-2026-headless-ai-server-journey/175243#post_7",
"publishedAt": "2026-04-24T19:28:12.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "### **Project Summary: Multi-Server Cluster Optimization**\n\n**Timeline:** April 24, 2026 (approx. 3 hours of tuning)\n\n* * *\n\n#### **1. Infrastructure & UI Upgrade**\n\n * **Update:** Migrated the **Lenovo ThinkServer** to **Open WebUI v2.0 (v0.9.2)**.\n\n * **Result:** Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.\n\n\n\n\n#### **2. Model Deployment: Gemma 4 26B**\n\n * **Model:** Deployed **Gemma 4 26B-A4B-it (Q4_K_M)**.\n\n * **Advantage:** Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.\n\n\n\n\n#### **3. Hardware Tuning: “The 97% Sweet Spot”**\n\n * **GPU:** **AMD Radeon 7800 XT (16GB VRAM)**.\n\n * **Config:** Manually tuned to **27 layers** with a **16,000 context window**.\n\n * **Result:** Achieved **97% VRAM utilization**. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.\n\n\n\n\n* * *\n\n#### **4. Benchmarks**\n\nMetric | Performance\n---|---\n**Generation Speed** | **25.13 tokens/s** (Instantaneous feel)\n**Prompt Processing** | **283.03 tokens/s** (Fast large-file reading)\n**Stability** | **100% stable** at 97% load (Headless mode)\n\n#",
"title": "Building Local: My 2026 Headless AI Server Journey"
}