Inquiry About Dataset for AI-Driven Cloud Load Balancing and Auto scaling of instances
sohamk28:
I’m currently building a Smart Load Balancer with Auto-Scaling Instances and exploring ways to optimize cloud performance using AI-based techniques.
I’m looking for a dataset that contains:
- Server or VM utilization data (CPU, memory, network usage)
- Task or request distribution logs
- Auto-scaling or workload patterns over time
- Any real or simulated cloud performance metrics
I’d really appreciate it if anyone could suggest:
- Publicly available cloud workload datasets
- Google, Alibaba, or Azure cluster traces
- Or any datasets that can help in modeling or testing AI-based load balancing algorithms
Thanks in advance for your help and suggestions
— Soham Kale
Hi Soham, you can cover this in two ways: use public traces for realism, and synthetic traces for controlled stress testing.
Public datasets worth checking:
Google cluster traces (Borg) for job/task scheduling and resource usage patterns
Alibaba cluster trace for container workloads and utilization over time
Azure traces and other public workload datasets from academic benchmarking papers
Also look for “cluster trace”, “workload trace”, “autoscaling trace”, “request trace”, “datacenter telemetry”, “Kubernetes trace” on the Hub
If you cannot find a dataset with all signals in one place, a common approach is to fuse:
a request arrival trace (per service) plus
a resource utilization trace (per node or pod) then derive autoscaling events from policy simulation.
How I can help you directly:
Provide a ready to use synthetic dataset generator that produces time series for CPU, memory, network, request rate, latency, error rate, plus autoscaling actions under different policies (HPA style, predictive, RL style)
Include bursty traffic, diurnal seasonality, noisy telemetry, failures, and multi service interference
Output formats that plug into training easily, like parquet plus a gym style environment spec for RL or a supervised dataset for predicting scale up and scale down actions
Add evaluation scripts for cost latency SLO violations and stability metrics, so you can compare heuristics vs learned policies
Discussion in the ATmosphere