Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibnaodzcvuca5hm5jvwitnyfrxaf6ekgzs6veyuzbaw7vpyq36i5m",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkbchtfjq6f2"
  },
  "path": "/t/building-local-my-2026-headless-ai-server-journey/175243#post_7",
  "publishedAt": "2026-04-24T19:28:12.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "### **Project Summary: Multi-Server Cluster Optimization**\n\n**Timeline:** April 24, 2026 (approx. 3 hours of tuning)\n\n* * *\n\n#### **1. Infrastructure & UI Upgrade**\n\n  * **Update:** Migrated the **Lenovo ThinkServer** to **Open WebUI v2.0 (v0.9.2)**.\n\n  * **Result:** Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.\n\n\n\n\n#### **2. Model Deployment: Gemma 4 26B**\n\n  * **Model:** Deployed **Gemma 4 26B-A4B-it (Q4_K_M)**.\n\n  * **Advantage:** Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.\n\n\n\n\n#### **3. Hardware Tuning: “The 97% Sweet Spot”**\n\n  * **GPU:** **AMD Radeon 7800 XT (16GB VRAM)**.\n\n  * **Config:** Manually tuned to **27 layers** with a **16,000 context window**.\n\n  * **Result:** Achieved **97% VRAM utilization**. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.\n\n\n\n\n* * *\n\n#### **4. Benchmarks**\n\nMetric | Performance\n---|---\n**Generation Speed** | **25.13 tokens/s** (Instantaneous feel)\n**Prompt Processing** | **283.03 tokens/s** (Fast large-file reading)\n**Stability** | **100% stable** at 97% load (Headless mode)\n\n#",
  "title": "Building Local: My 2026 Headless AI Server Journey"
}