External Publication

Inquiry About Dataset for AI-Driven Cloud Load Balancing and Auto scaling of instances

Hugging Face Forums [Unofficial] March 4, 2026

sohamk28:

I’m currently building a Smart Load Balancer with Auto-Scaling Instances and exploring ways to optimize cloud performance using AI-based techniques.

I’m looking for a dataset that contains:

Server or VM utilization data (CPU, memory, network usage)

Task or request distribution logs

Auto-scaling or workload patterns over time

Any real or simulated cloud performance metrics

I’d really appreciate it if anyone could suggest:

Publicly available cloud workload datasets

Google, Alibaba, or Azure cluster traces

Or any datasets that can help in modeling or testing AI-based load balancing algorithms

Thanks in advance for your help and suggestions

— Soham Kale

Hi Soham, you can cover this in two ways: use public traces for realism, and synthetic traces for controlled stress testing.

Public datasets worth checking:

Google cluster traces (Borg) for job/task scheduling and resource usage patterns
Alibaba cluster trace for container workloads and utilization over time
Azure traces and other public workload datasets from academic benchmarking papers
Also look for “cluster trace”, “workload trace”, “autoscaling trace”, “request trace”, “datacenter telemetry”, “Kubernetes trace” on the Hub

If you cannot find a dataset with all signals in one place, a common approach is to fuse:

a request arrival trace (per service) plus
a resource utilization trace (per node or pod) then derive autoscaling events from policy simulation.

How I can help you directly:

Provide a ready to use synthetic dataset generator that produces time series for CPU, memory, network, request rate, latency, error rate, plus autoscaling actions under different policies (HPA style, predictive, RL style)
Include bursty traffic, diurnal seasonality, noisy telemetry, failures, and multi service interference
Output formats that plug into training easily, like parquet plus a gym style environment spec for RL or a supervised dataset for predicting scale up and scale down actions
Add evaluation scripts for cost latency SLO violations and stability metrics, so you can compare heuristics vs learned policies

Discussion in the ATmosphere