Stale Completed or Failed Fargate Pods Causing Direct Billing and Capacity Waste

Tai Nguyen

CER:

AWS-Compute-9638

Service Category

Compute

Cloud Provider

AWS

Service Name

Amazon EKS

Inefficiency Type

Unnecessary compute and networking charges

Explanation

This inefficiency occurs when Kubernetes Jobs or CronJobs running on EKS Fargate leave completed or failed pod objects in the cluster indefinitely. Although the workload execution has finished, AWS keeps the underlying Fargate microVM running to allow log inspection and final status checks. As a result, vCPU, memory, and networking resources remain allocated and billable until the pod object is explicitly deleted.

Over time, large numbers of stale Job pods can generate direct compute charges as well as consume ENIs and IP addresses, leading to both unnecessary spend and capacity pressure. This pattern is common in batch-processing and scheduled workloads that lack automated cleanup.

Relevant Billing Model

On EKS Fargate, billing for vCPU and memory continues as long as the pod object exists, even after a Job pod reaches a Completed (Succeeded or Failed) state. Fargate infrastructure is only released—and billing stops—when the pod object is deleted from the Kubernetes API server.

Detection

Review whether completed or failed Fargate pods persist long after Jobs finish
Assess whether batch or CronJob workloads accumulate large numbers of historical pods
Identify environments where Job execution completes but pod cleanup is not automated

Remediation

Enable automatic cleanup of finished Jobs using TTL-after-finished policies
Configure CronJobs to retain minimal successful and failed job history
Treat pod lifecycle cleanup as a required design consideration for Fargate-based batch workloads

Relevant Documentation

https://docs.aws.amazon.com/eks/latest/userguide/fargate.html

https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Submit Feedback