Reducing Azure Kubernetes Service Costs: A Real-World Node Pool Downsizing Experience

Recently, I had the opportunity to significantly reduce our Azure Kubernetes Service (AKS) costs by downsizing our node VMs without disrupting our applications. In this post, I’ll share the approach, challenges, and lessons learned.

The Starting Point

Our AKS cluster was running with powerful VMs that provided plenty of resources for our analytics processing workloads. However, after checking our resource utilization through standard Kubernetes monitoring commands like kubectl top nodes and kubectl top pods, I discovered we were significantly overprovisioned

Monitoring revealed that our actual resource utilization was just a fraction of what we had allocated. This presented a clear opportunity for cost optimization by moving to smaller VM sizes, potentially cutting our compute costs by approximately 50%.

Understanding VM Size Selection

In Azure, the choice of VM size directly impacts both performance and cost. For example, moving from a Standard_D8ds_v5 (8 vCPUs, 32GB RAM) to a Standard_D4ds_v5 (4 vCPUs, 16GB RAM) can reduce costs by approximately 50% while still providing sufficient resources for many workloads.

When selecting VM sizes for Kubernetes nodes, it’s crucial to consider:

  • Application requirements: CPU, memory, and disk I/O needs for your specific workloads
  • Burstable workloads: Whether your applications have consistent usage or occasional spikes
  • Node density: How many pods will run on each node and their combined resource requirements

The Challenge

Downsizing VM sizes in AKS isn’t as simple as changing a setting. It requires creating new node pools and carefully migrating workloads while ensuring:

  1. Zero downtime for user-facing services
  2. Preservation of stateful application data
  3. Sufficient resources for workload spikes

The Safe Migration Process

Instead of trying to resize existing node pools (which would cause downtime), I followed a safer approach:

1. Create a new node pool with smaller VMs

1
2
3
4
5
6
7
8
9
az aks nodepool add \
--resource-group my-resource-group \
--cluster-name my-cluster \
--name userpool2 \
--node-vm-size Standard_D4ds_v5 \
--node-count 2 \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 5

The command in this example, creates a new node pool with Standard_D4ds_v5 VMs, which offer 4 vCPUs and 16GB of RAM. Enabling the cluster autoscaler allows the node count to dynamically adjust between 2 and 5 nodes based on workload demands, providing flexibility for handling traffic spikes while maintaining cost efficiency during low-demand periods.

2. Right-size resource requests and limits

This was crucial for ensuring pods could be scheduled effectively on the smaller nodes. I adjusted resource specifications in our deployment manifests to better match actual utilization while still providing headroom for spikes:

1
2
3
4
5
6
7
resources:
requests:
memory: "2Gi" # Down from 8Gi
cpu: "1000m" # Down from 4000m
limits:
memory: "4Gi" # Down from 10Gi
cpu: "3000m" # Down from 6000m

Understanding resource allocation in Kubernetes is critical for cost optimization:

  • Resource requests: The guaranteed minimum resources allocated to a pod. Kubernetes uses this for scheduling decisions.
  • Resource limits: The maximum resources a pod can consume. Exceeding CPU limits throttles the pod, while exceeding memory limits can cause termination.

For our video processing workloads, I observed that actual memory usage was around 2GB during normal operations, with occasional spikes to 3-3.5GB during intensive processing. Setting requests at 2GB and limits at 4GB provided necessary headroom without over-allocation.

Similar analysis for CPU showed that our pods typically used 100-200m CPU, with peaks up to 1.5-2 cores during video encoding. Setting requests at 1 core and limits at 3 cores allowed for performance spikes while maintaining efficient resource allocation.

3. Target deployments to the new node pool

We added nodeSelector to our deployment manifests to direct workloads to the new node pool:

1
2
3
4
5
spec:
template:
spec:
nodeSelector:
agentpool: userpool2

This Kubernetes feature allows precise control over pod placement, enabling a gradual migration strategy. By updating one deployment at a time, I could verify each service was working correctly on the new nodes before proceeding with the next.

4. Handle stateful workloads with care

One challenge when downsizing node pools is handling any workloads that maintain state. In production environments, it’s generally better to use managed services outside the cluster for persistent data (like managed databases, cache services, or storage accounts) rather than running stateful applications inside Kubernetes with PVCs.

However, if you do have stateful workloads in your cluster, pay special attention during migration:

1
2
3
# For stateful workloads, consider backing up data first
# Then carefully coordinate the transition between nodes
kubectl get pv,pvc -A # Identify any persistent volumes in use

By properly planning the migration of these components and potentially scheduling maintenance windows for critical stateful services, you can ensure data integrity throughout the transition process.

Results and Validation

After the migration, we validated our workloads on the new, smaller nodes:

1
2
3
4
5
6
7
8
# All pods now running on new nodes
$ kubectl get pods -o wide

$ kubectl top nodes
# Output shows healthy utilization on new nodes

$ kubectl top pods
# Output shows all services running with appropriate resource consumption

Even with our most demanding video processing workloads, the new nodes handled the load effectively, with plenty of headroom.

Cost Savings and Lessons Learned

This migration resulted in a 50% reduction in our AKS compute costs for this workload, which translates to significant annual savings.

Key takeaways from this experience:

  1. Right-sizing is an ongoing process: Initial resource allocations are often conservative. Continuously monitor and adjust as you gather real usage data.

  2. Safe migrations require planning: Creating new node pools and gradually migrating workloads is safer than trying to resize existing pools.

  3. Stateful workloads need special attention: Persistent volumes add complexity to node migrations. Always have backups and a rollback plan.

  4. Kubernetes is flexible: The ability to precisely control pod placement with nodeSelectors made it possible to test and validate the migration incrementally.

  5. Autoscaling complements right-sizing: After downsizing, we maintained our horizontal pod autoscaler and enabled cluster autoscaling to handle any unexpected load spikes.