External Publication

Building Local: My 2026 Headless AI Server Journey

Hugging Face Forums [Unofficial] April 24, 2026

Source

Project Summary: Multi-Server Cluster Optimization

Timeline: April 24, 2026 (approx. 3 hours of tuning)

1. Infrastructure & UI Upgrade

Update: Migrated the Lenovo ThinkServer to Open WebUI v2.0 (v0.9.2).
Result: Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.

2. Model Deployment: Gemma 4 26B

Model: Deployed Gemma 4 26B-A4B-it (Q4_K_M).
Advantage: Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.

3. Hardware Tuning: “The 97% Sweet Spot”

GPU: AMD Radeon 7800 XT (16GB VRAM).
Config: Manually tuned to 27 layers with a 16,000 context window.
Result: Achieved 97% VRAM utilization. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.

4. Benchmarks

Metric	Performance
Generation Speed	25.13 tokens/s (Instantaneous feel)
Prompt Processing	283.03 tokens/s (Fast large-file reading)
Stability	100% stable at 97% load (Headless mode)

Discussion in the ATmosphere

Loading comments...