External Publication
Visit Post

Building Local: My 2026 Headless AI Server Journey

Hugging Face Forums [Unofficial] April 24, 2026
Source

Project Summary: Multi-Server Cluster Optimization

Timeline: April 24, 2026 (approx. 3 hours of tuning)


1. Infrastructure & UI Upgrade

  • Update: Migrated the Lenovo ThinkServer to Open WebUI v2.0 (v0.9.2).

  • Result: Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.

2. Model Deployment: Gemma 4 26B

  • Model: Deployed Gemma 4 26B-A4B-it (Q4_K_M).

  • Advantage: Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.

3. Hardware Tuning: “The 97% Sweet Spot”

  • GPU: AMD Radeon 7800 XT (16GB VRAM).

  • Config: Manually tuned to 27 layers with a 16,000 context window.

  • Result: Achieved 97% VRAM utilization. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.


4. Benchmarks

Metric Performance
Generation Speed 25.13 tokens/s (Instantaneous feel)
Prompt Processing 283.03 tokens/s (Fast large-file reading)
Stability 100% stable at 97% load (Headless mode)

Discussion in the ATmosphere

Loading comments...