Amazon EMR clusters often run on large, multi-node EC2 fleets, making them costly to leave running unnecessarily. If a cluster becomes idle—no longer processing jobs—but is not terminated, it continues accruing EC2 and EMR service charges. Many teams forget to shut down clusters manually or leave them running for debugging, staging, or future job use. Without an auto-termination policy, this oversight leads to significant unnecessary spend.
Many teams run workloads on standard Fargate pricing even when the workload is fault-tolerant and could tolerate interruptions. Fargate Spot provides the same performance characteristics at up to 70% lower cost, making it ideal for stateless jobs, batch processing, CI/CD runners, or retry-friendly microservices.
If backup retention settings are too high or old automated backups are unnecessarily retained, costs can accumulate rapidly. RDS backup storage is significantly more expensive than equivalent storage in S3. For long-term retention or compliance use cases, exporting backups to S3 (e.g., via snapshot export to Amazon S3 in Parquet) is often more cost-effective than retaining them in RDS-native format.
Lambda functions default to the x86\_64 architecture, which is more expensive than Arm64. For many workloads, especially those written in interpreted languages (e.g., Python, Node.js) or compiled to architecture-neutral bytecode (e.g., Java), there is no dependency on x86-specific binaries. In such cases, moving to Arm64 can reduce compute costs by 20% without impacting functionality. Despite this, many teams continue to run Lambda functions on x86\_64 due to legacy configurations, inertia, or lack of awareness. This leads to avoidable spending, particularly at scale or in high-volume environments.
Improper design choices in AWS Step Functions can lead to unnecessary charges. For example: * Using **Standard Workflows** for short-lived, high-frequency executions leads to excessive per-transition charges. * Using **Express Workflows** for long-running processes (close to or exceeding the 5-minute limit) may cause timeouts or retries. * Inefficient use of states—such as chaining many simple states instead of combining logic into a Lambda function—can increase cost in both workflow types. * Overuse of payload-passing between states (especially in Express workflows) increases GB-second and data transfer charges.
By default, EventBridge includes retry mechanisms for delivery failures, particularly when targets like Lambda functions or Step Functions fail to process an event. However, if these retry policies are disabled or misconfigured, EventBridge may treat failed deliveries as successful, prompting upstream services to republish the same event multiple times in response to undelivered outcomes. This leads to: * Duplicate event publishing and delivery * Unnecessary compute triggered by repeated events * Increased EventBridge, downstream service, and data transfer costs This behavior is especially problematic in systems where idempotency is not strictly enforced and retries are managed externally by upstream services.
By default, CloudWatch Log Groups use the Standard log class, which applies higher rates for both ingestion and storage. AWS also offers an Infrequent Access (IA) log class designed for logs that are rarely queried — such as audit trails, debugging output, or compliance records. Many teams assume storage is the dominant cost driver in CloudWatch, but in high-volume environments, ingestion costs can account for the majority of spend. When logs that are infrequently accessed are ingested into the Standard class, it leads to unnecessary costs without impacting observability. The IA log class offers significantly reduced rates for ingestion and storage, making it a better fit for logs used primarily for post-incident review, compliance retention, or ad hoc forensic analysis.
When an EC2 Mac instance is stopped or terminated, its associated dedicated host remains allocated by default. Because Mac instances are the only EC2 type billed at the host level, charges continue to accrue as long as the host is retained. This can lead to significant waste when: * Instances are stopped but the host is never released * Hosts are retained unintentionally after workloads are migrated or decommissioned * Automation only terminates instances without deallocating hosts
In many Databricks environments, large Delta tables are created without enabling standard optimization features like partitioning and Z-Ordering. Without these, queries scanning large datasets may read far more data than necessary, increasing execution time and compute usage. * **Partitioning** organizes data by a specified column to reduce scan scope. * **Z-Ordering** optimizes file sorting to minimize I/O during range queries or filters. * **Delta Format** enables additional optimizations like data skipping and compaction. Failing to use these features in high-volume tables often results in avoidable performance overhead and elevated spend, especially in environments with frequent exploratory queries or BI workloads.
Business Intelligence dashboards and ad-hoc analyst queries frequently drive Databricks compute usage — especially when: * Dashboards are auto-refreshed too frequently * Queries scan full datasets instead of leveraging filtered views or materialized tables * Inefficient joins or large broadcast operations are used * Redundant or exploratory queries are triggered during interactive exploration This often results in clusters staying active for longer than necessary, or being autoscaled up to handle inefficient workloads, leading to unnecessary DBU consumption.