hub-backup

Infrequently Accessed Data Retained on High-Performance FSx File Systems

Storage

Cloud Provider

AWS

Service Name

Amazon FSx

Inefficiency Type

Inefficient Configuration

Amazon FSx file systems are designed for performance-sensitive workloads such as shared enterprise file systems, high-performance computing, analytics, and machine learning. Storage costs are driven by provisioned capacity (measured in GB-months) and throughput capacity (measured in MBps-months), regardless of how frequently the stored data is actually accessed. When datasets become archival, historical, or reference-only in nature — often after project completion, workload migration, or data lifecycle changes — retaining them on high-performance FSx storage results in sustained premium charges for data that could reside on significantly cheaper alternatives.

The severity of this inefficiency varies by FSx variant. FSx for Windows File Server is most directly exposed because it lacks native automatic tiering to external cold storage tiers — all data remains on provisioned SSD or HDD capacity with no built-in mechanism to move cold data to lower-cost object storage. FSx for NetApp ONTAP, by contrast, offers automatic data tiering to a lower-cost capacity pool tier, but this feature must be properly configured with appropriate tiering policies per volume; if left at default settings or misconfigured, cold data may still occupy expensive SSD storage. FSx for Lustre and FSx for OpenZFS support tiering storage classes that automatically move data between access tiers, but only when this storage class is selected at deployment. In all cases, the waste stems from the same root cause: high-performance storage capacity being consumed by data that no longer requires — or never required — that level of performance.

Learn more

Spot Instance Overreliance Without Effective Cost-Per-Performance Analysis

Compute

Cloud Provider

AWS

Service Name

AWS EC2

Inefficiency Type

Inefficient Configuration

Organizations frequently pursue aggressive Spot Instance adoption based on headline discount percentages — up to 90% off On-Demand pricing — without evaluating the effective cost per unit of work completed. While Spot pricing can deliver significant savings for well-suited workloads, the actual blended cost of a Spot-heavy architecture is often higher than the headline discount suggests. Interruption handling requires fault-tolerant design, automated replacement mechanisms, checkpointing, and fallback capacity strategies — all of which add operational overhead and can erode the effective savings. When fallback instances run at On-Demand rates during capacity reclamation events, the blended hourly cost across the fleet rises substantially above the Spot rate alone.

This pattern is compounded when Spot fleets rely on older-generation instance types. AWS releases new instance generations regularly, and newer generations typically deliver meaningfully better performance per dollar at similar or lower hourly rates. For example, ARM-based processor instances can deliver up to 40% better price-performance compared to equivalent x86-based instances. An organization running older-generation Spot Instances may achieve a high discount percentage relative to On-Demand but still pay more per unit of actual compute work than it would on current-generation instances covered by a Savings Plan commitment. The result is a fleet that appears cost-optimized by discount rate but is inefficient by the more meaningful measure of cost per transaction, request, or compute cycle.

This inefficiency reflects a FinOps maturity gap where rate optimization (lower per-unit price) is pursued without balancing it against usage optimization (fewer units needed for the same work). Teams that measure success by "percentage of workloads on Spot" rather than "effective cost per unit of work" are particularly susceptible. A holistic purchasing strategy that considers instance generation, workload stability, interruption tolerance, and total cost of ownership — including operational overhead — often delivers more predictable and competitive cost efficiency than maximizing Spot coverage alone.

Learn more

Continuous Backup Enabled on Non-Production MongoDB Atlas Clusters

Databases

Cloud Provider

Service Name

MongoDB Atlas

Inefficiency Type

Inefficient Configuration

MongoDB Atlas offers two backup mechanisms for dedicated clusters: Cloud Backups (scheduled snapshots using the underlying cloud provider's native snapshot functionality) and Continuous Cloud Backup, which adds point-in-time recovery by continuously capturing the cluster's oplog — a log of all write operations. Continuous Cloud Backup is an optional add-on for M10+ dedicated clusters that stores both snapshots and oplog data, enabling restoration to any specific second within a configurable restore window. While this capability is critical for production workloads with strict Recovery Point Objectives (RPOs), it provides limited value on development, testing, or staging clusters where data is typically transient, synthetic, or easily reproducible.

This inefficiency commonly arises when organizations apply infrastructure-as-code templates or centralized backup policies uniformly across all environments without differentiating between production and non-production recovery requirements. Because Continuous Cloud Backup continuously captures and stores oplog data in object storage, storage charges accumulate based on both the configured restore window and the volume of write activity on the cluster. Clusters with moderate to high write throughput generate proportionally larger oplogs, amplifying the cost impact. MongoDB's own architecture guidance explicitly recommends against enabling backup for development and test environments, recognizing that the cost of continuous oplog storage rarely justifies the recovery benefit for non-critical workloads.

Learn more

Excessive NAT Gateway Data Processing Charges from Unoptimized Traffic Routing

Networking

Cloud Provider

AWS

Service Name

AWS NAT Gateway

Inefficiency Type

Inefficient Architecture

NAT Gateway charges a per-gigabyte data processing fee on all traffic that passes through it — in either direction — regardless of whether the destination is the public internet or another AWS service in the same region. This per-GB charge is separate from and additive to the hourly provisioning cost, and for workloads with meaningful throughput, it quickly becomes the dominant cost component. In many US regions, the data processing charge matches the hourly rate (e.g., $0.045/GB in US East Ohio), meaning that once monthly traffic exceeds roughly 720 GB, data processing costs surpass the baseline hourly charges entirely. For internet-bound traffic, a compounding effect occurs: hourly provisioning, per-GB data processing, and standard data transfer out charges all apply simultaneously — creating a combined variable cost that can reach $0.135 per GB or more.

This cost structure is frequently underestimated during architecture planning. Teams designing VPC layouts often account for the hourly cost of NAT Gateways but overlook how significantly the per-GB processing fee scales with traffic volume. The result is that workloads routing high-throughput traffic to AWS services like S3, DynamoDB, container registries, or logging endpoints through NAT Gateway incur substantial and avoidable data processing charges. Gateway VPC endpoints for S3 and DynamoDB carry no hourly or data processing charges at all, and interface VPC endpoints for other AWS services process data at a fraction of the NAT Gateway rate. Without deliberate traffic routing decisions, NAT Gateway data processing can quietly become one of the largest line items on an AWS bill.

Learn more

Excessive On-Demand Compute Spend Due to Low Savings Plan and Reserved Instance Coverage

Compute

Cloud Provider

AWS

Service Name

AWS EC2

Inefficiency Type

Suboptimal Pricing Model

AWS compute services charge the full published On-Demand rate when no commitment-based discount — such as a Savings Plan (SP) or Reserved Instance (RI) — is in effect. On-Demand pricing provides maximum flexibility, but it is also the most expensive way to run workloads that have stable, predictable usage patterns. When an organization runs a large share of its steady-state compute at On-Demand rates instead of covering that baseline with SPs or RIs, it is effectively paying a premium for capacity it could have committed to at a materially lower cost.

This inefficiency is one of the most common and impactful cost optimization gaps in AWS environments. It typically arises from a lack of commitment ownership, insufficient workload analysis to identify stable baselines, organizational silos that limit visibility into aggregate usage patterns, or hesitation around long-term contracts. The cost impact scales directly with compute spend — organizations with significant monthly compute bills can leave substantial savings on the table by failing to commit their predictable baseline. Two key dimensions define the gap: coverage (what percentage of eligible usage is protected by commitments) and utilization (whether purchased commitments are being fully consumed).

Compute Savings Plans commit to a consistent dollar-per-hour spend and automatically apply across EC2 (any instance family, size, region, OS, or tenancy), Fargate, and Lambda usage. EC2 Instance Savings Plans also commit to a dollar-per-hour spend but are scoped to a specific instance family within a chosen region, offering deeper discounts in exchange for reduced flexibility while still allowing changes across sizes, operating systems, and tenancy within that family. Reserved Instances commit to specific EC2 instance configurations. Standard Reserved Instances provide the highest discounts but cannot be exchanged; Convertible Reserved Instances offer slightly lower discounts but can be exchanged for different configurations during the term. All require one-year or three-year terms. Savings Plans with an hourly commitment of $100 or less can be returned within seven days of purchase, provided the return occurs within the same calendar month; once the calendar month ends, they can no longer be returned. Standard Reserved Instances can be sold on the Reserved Instance Marketplace under certain conditions, including a minimum 30-day holding period and at least one month remaining in the term, though Reserved Instances purchased at a discount or originally acquired from the marketplace cannot be resold. The goal is not to commit all usage — only the stable baseline. Variable and burst capacity should remain On-Demand. When commitments expire, usage silently reverts to full On-Demand pricing, which can also contribute to coverage erosion over time if renewals are not actively managed.

Learn more

RDS SQL Server Running Bundled Licensing on Older Instance Families

Databases

Cloud Provider

AWS

Service Name

Amazon RDS

Inefficiency Type

Suboptimal Pricing Model

Amazon RDS for SQL Server has traditionally used a License Included model where the SQL Server license cost is bundled into a single hourly instance price alongside Windows OS licensing, compute resources, and RDS management capabilities. On older generation instance families such as db.R6i, db.M6i, db.R5, and db.M5, this bundled rate offers no visibility into how much of the hourly cost is attributable to licensing versus infrastructure — and the licensing component can represent a substantial portion of the total charge, especially for Standard and Enterprise editions.

Starting with 7th generation instances (db.M7i and db.R7i), AWS introduced an unbundled pricing model that separates infrastructure costs from SQL Server licensing fees, billing them as distinct line items. This structural change can yield significantly lower total costs compared to equivalent previous-generation instances. Additionally, the unbundled model enables the Optimize CPU feature, which allows customers to reduce vCPU count — and therefore licensing charges — while retaining the same physical core count, memory, and IOPS capacity. This is particularly valuable for memory-intensive or IOPS-intensive SQL Server workloads that don't need high vCPU counts but were previously forced to pay for licensing on all provisioned vCPUs.

Organizations running RDS SQL Server on older instance families continue to pay the higher bundled rate unnecessarily. The savings opportunity compounds in Multi-AZ deployments and on larger instance sizes (2xlarge and above), where hyperthreading is disabled by default on 7th generation instances, effectively halving the vCPU count and the associated licensing fees without sacrificing physical core performance.

Learn more

Non-Production RDS SQL Server Using Standard or Enterprise Edition Instead of Developer Edition

Databases

Cloud Provider

AWS

Service Name

Amazon RDS

Inefficiency Type

Inefficient Configuration

Amazon RDS for SQL Server uses a License Included pricing model where the hourly instance rate bundles Microsoft SQL Server licensing fees on a per-vCPU basis. When non-production workloads — such as development, testing, staging, QA, or UAT environments — run on Standard or Enterprise editions, they incur these per-vCPU licensing charges even though the workloads do not require a production-grade license. SQL Server licensing is a major component of the total RDS instance cost, and this overhead scales directly with the number of virtual CPUs provisioned.

Since December 2025, Amazon RDS for SQL Server supports Developer Edition, which includes all Enterprise Edition features but is licensed by Microsoft exclusively for non-production use. Developer Edition instances incur only AWS infrastructure costs with no SQL Server licensing fees. Prior to this capability, customers had no option to use Developer Edition on standard RDS and were forced to pay for Standard or Enterprise licenses even in non-production environments. Organizations with multiple non-production environments running Standard or Enterprise editions now have a significant opportunity to eliminate unnecessary licensing costs by migrating to Developer Edition.

Developer Edition on RDS is provisioned through a Custom Engine Version (CEV) approach, which requires a one-time setup per SQL Server version. While this adds initial complexity compared to standard RDS instance creation, the ongoing licensing savings can be substantial — particularly for organizations running several non-production SQL Server instances across development, testing, and staging environments.

Learn more

Suboptimal Cache TTL Strategy Causing Repeated Backend Execution

Databases

Cloud Provider

AWS

Service Name

AWS ElastiCache

Inefficiency Type

Inefficient Configuration

Organizations deploy ElastiCache to reduce load on backend systems — databases, APIs, and compute layers — by serving frequently accessed data from fast in-memory storage. However, when Time-to-Live (TTL) values are misaligned with actual data change patterns, the cache delivers poor hit rates and fails to eliminate backend workload. This creates a particularly costly form of dual waste: the organization pays continuously for ElastiCache infrastructure while simultaneously incurring the full backend compute and database costs that caching was meant to reduce.

This inefficiency is especially insidious because it is not immediately visible in cost reporting. ElastiCache charges appear as expected infrastructure spend, while the failure to meaningfully reduce backend costs goes unnoticed unless teams actively correlate cache hit rates with backend workload. The pattern commonly emerges when caching is deployed with default or arbitrary TTL values without analyzing how frequently the underlying data actually changes. When TTL is set too short relative to data volatility, cache entries expire before they can be reused — a phenomenon known as cache churn — turning the cache into an expensive pass-through layer that adds cost and latency without delivering value.

The cost impact scales directly with traffic volume. High-traffic applications with poor cache hit rates waste significant spend on both caching infrastructure and unnecessary backend processing. Critically, this is distinct from over-provisioning cache capacity; the waste occurs even with properly sized cache nodes if the TTL strategy does not align with data change frequency. Each cache miss incurs three operations — the initial cache check, the backend query, and the cache population step — adding both latency and backend load compared to having no cache at all.

Learn more

Excess vCPU Licensing Costs on RDS for SQL Server Instances

Databases

Cloud Provider

AWS

Service Name

AWS RDS

Inefficiency Type

Inefficient Configuration

Amazon RDS for SQL Server uses a License Included pricing model where SQL Server and Windows OS licensing costs are bundled into the per-instance-hour rate — and those licensing costs scale directly with the number of vCPUs on the instance. Many SQL Server workloads, particularly OLTP, reporting, and data warehousing scenarios, are constrained by memory and storage throughput rather than raw CPU capacity. Organizations frequently provision large instance types to obtain the memory or IOPS their workloads require, but in doing so they also pay for a high vCPU count that remains largely underutilized. Because SQL Server licensing often represents the single largest cost component of an RDS for SQL Server instance, paying for unnecessary vCPUs translates directly into wasted licensing spend.

AWS offers an Optimize CPU feature on 7th generation instance classes (M7i and R7i) that allows customers to reduce the active core count on their RDS for SQL Server instances while preserving the same memory and IOPS capacity. On these newer generation instances, hyperthreading is disabled by default, and vCPU reduction is achieved by lowering the physical core count. AWS benchmarks demonstrate that instances with reduced vCPU counts can match the transaction throughput of instances with double the CPU, with utilization remaining within acceptable thresholds. This feature is supported on Enterprise, Standard, and Web editions for instance sizes of 2xlarge and above, with a minimum of 4 vCPUs after optimization. Organizations that have not evaluated or applied this configuration are likely overpaying for SQL Server licensing on every eligible instance in their fleet.

Learn more

Idle Azure NAT Gateway Attached to Subnet Without Active Workloads

Networking

Cloud Provider

Azure

Service Name

Azure NAT Gateway

Inefficiency Type

Unused Resource

Azure NAT Gateways are commonly deployed to provide outbound internet connectivity for resources within virtual network subnets. Over time, the workloads that originally required this outbound access may be scaled down, migrated, or decommissioned entirely. However, the NAT Gateway often remains attached to the subnet — continuing to incur hourly charges even when no active resources are using it. Because billing begins the moment the resource is created and continues for every hour it exists, an idle NAT Gateway generates a steady, fixed cost with zero functional return.

This waste pattern is particularly common in development, testing, and staging environments where infrastructure is provisioned for temporary workloads but networking components are not included in cleanup processes. NAT Gateways are subnet-level networking primitives, often provisioned by platform or infrastructure teams separately from the application teams that use them. This organizational separation creates gaps in ownership and cleanup responsibility, allowing idle gateways to persist unnoticed. Additionally, NAT Gateway has no stopped or paused state — the only way to stop billing is to delete the resource entirely. Even seemingly idle subnets can generate small data processing charges from background processes such as operating system updates or monitoring agents, which may create a misleading appearance of utilization and further delay cleanup.

The cost impact compounds when organizations maintain multiple idle NAT Gateways across subscriptions and environments. Each gateway also typically has an associated public IP address that incurs its own separate hourly charge, adding to the waste.

Learn more

There are no inefficiency matches the current filters.