How we optimized the Amazon Elastic Kubernetes Service (EKS) costs by 25%
While working for a European telecom provider; we had hosted their customer-facing application and hosted it on the Amazon EKS cluster. While the application was stable, and we already secured the benefits of a fully managed container platform with Amazon EKS, we were not sure if the application is cost-optimized and are there were opportunities to improve on this front. I was tasked with identifying and implementing the cost-saving strategies while ensuring the same level of security and performance.
While using Amazon EKS there is a fixed cost for the control plane ($0.10 per hour for each Kubernetes cluster at the time of writing), everything else is charged on a pay-as-you-go basis. We can use a single Amazon EKS cluster to run multiple applications by taking advantage of Kubernetes namespaces and AWS IAM security policies.
As a starting point, there should be a way to monitor resource usage. Amazon EC2 worker node instances that are not fully utilized need to be optimized.
· Scale-Up — Optimize the Amazon EC2 instance hours by positioning the number of nodes in the cluster to the demand. While this is managed by the control plane based on the elastic requirement.
· Right-Sizing the Pods — In an Amazon EKS environment while we are paying for the Amazon EC2 run hours; it’s basically the cost of hosting the pods. So, we have required a way to not only understand the right sizing of the Amazon EC2 but also the right sizing of the pods. Pod size is an important element for controlling Amazon EKS costs and monitoring the CPU and memory utilization metrics of the Prometheus collector instances is recommended. Also, set resource constraints, which effectively verifies that no program or operator of the Amazon EKS system uses too many resources. A container can’t use more than the resource limit you set. You can specify the required minimum resources known as requests, and the maximum resource usage called limits. Remembering that resources are declared for each container in the pod ‘individually’ and not per pod is very important. The total resources required by a pod is the sum of the required resources by all its containers. We optimized the pod resources by allocating the right compute and memory resources per container.
· Scale Down — Optimize pod hours by terminating pods that are unnecessary during fewer traffic periods.
2. Purchase Options — Reserved Instances and Savings Plans
Purchase Options — Optimize the instance price by replacing On-Demand Instance with Spot Instances and purchasing reserved instance capacity. Again, Prometheus and Grafana help in the identification of the reserved capacity required. We also used AWS Cost explorer to suggest average usage across a longer timeframe.
3. Spot Instances
Although the affordability of multiple cloud services certainly makes cloud computing more accessible, they still could cost a lot, especially when the cost of using cloud resources is not managed properly. It is easy to secure extra capacity and resources that are not fully utilized and end up paying more than you should.
4. Optimizing the Container Images and Amazon Elastic Container Registry (Amazon ECR)
We spent significant time optimizing the size of the container images. It benefits in two ways:
· Amazon ECR uses Amazon S3 for storing the docker images. Smaller images save us on the storage costs. These images must be downloaded on the worker nodes to create the required pods. Smaller images cost less data transfer charges.
· Make sure to apply the image lifecycle policies to automatically keep the recent images readily available and archive the ones you don’t need. Also, use the rules and tagging to access the docker images faster.
5. Data Transfer Costs
Services like Amazon ECR, Amazon CloudWatch and Amazon S3 used in our environment support Amazon VPC endpoints. It is recommended to integrate these services via Amazon VPC endpoints. Amazon VPC endpoints not only allow private connectivity between resources in the VPC and supported AWS services, but also lower your network latency and cost because the traffic is not routed through the internet. We created three Amazon VPC endpoints and we could see a considerable reduction in VPC costs because the data transfer was happening via a private link and not via the internet.
We were able to save around 25% cost for our Kubernetes cluster through various strategies discussed above. All these strategies established best practices from the cost optimization pillar of the AWS Well-Architected Framework and helped our customer to run their Kubernetes workloads efficiently.