Each Lambda function must be configured with a memory setting, which indirectly controls the amount of CPU and networking performance allocated. In many environments, memory settings are defined arbitrarily or left unchanged as functions evolve. Over time, this leads to overprovisioning — with functions running well below their allocated memory and incurring unnecessary compute costs. Systematic right-sizing using performance benchmarks can significantly reduce spend without sacrificing performance or reliability. This is especially important for frequently invoked functions or those with long execution times.
Databricks cost optimization begins with visibility. Unlike traditional IaaS services, Databricks operates as an orchestration layer spanning compute, storage, and execution — but its billing data often lacks granularity by workload, job, or team. This creates a visibility gap: costs fluctuate without clear root causes, ownership is unclear, and optimization efforts stall due to lack of actionable insight. When costs are not attributed functionally — for example, to orchestration (query/job DBUs), compute (cloud VMs), storage, or data transfer — it becomes difficult to pinpoint what’s driving spend or where improvements can be made. As a result, inefficiencies persist not due to a single misconfiguration, but because the system lacks the structure to surface them.
While many AWS customers have migrated EC2 workloads to Graviton to reduce costs, Lambda functions often remain on the default x86 architecture. AWS Graviton2 (ARM) offers lower pricing and equal or better performance for most supported runtimes — yet adoption remains uneven due to legacy defaults or lack of awareness. Continuing to run eligible Lambda functions on x86 leads to unnecessary spending. The migration requires minimal configuration changes and can be verified through benchmarking and workload testing.
For Premium SSD and Standard SSD disks 513 GiB or larger, Azure now offers the option to enable Performance Plus — unlocking higher IOPS and MBps at no extra cost. Many environments that previously required custom performance settings continue to pay for additional throughput unnecessarily. By not enabling Performance Plus on eligible disks, organizations miss a straightforward opportunity to reduce disk spend while maintaining or improving performance. The feature is opt-in and must be explicitly enabled on each qualifying disk.
Each Azure VM size has a defined limit for total disk IOPS and throughput. When high-performance disks (e.g., Premium SSDs with high IOPS capacity) are attached to low-tier VMs, the disk’s performance capabilities may exceed what the VM can consume. This results in paying for performance that the VM cannot access. For example, attaching a large Premium SSD to a B-series VM will not provide the expected performance because the VM cannot deliver that level of throughput. Without aligning disk selection with VM limits, organizations incur unnecessary storage costs with no corresponding performance benefit.
Azure WAF configurations attached to Application Gateways can persist after their backend pool resources have been removed — often during environment reconfiguration or application decommissioning. In these cases, the WAF is no longer serving any functional purpose but continues to incur fixed hourly costs. Because no traffic is routed and no applications are protected, the WAF is effectively inactive. These orphaned WAFs are easy to overlook without regular cleanup processes and can quietly accumulate unnecessary charges over time.
Many EC2 workloads—such as development environments, test jobs, stateless services, and data processing pipelines—can tolerate interruptions and do not require the reliability of On-Demand pricing. Using On-Demand instances in these scenarios drives up cost without adding value. Spot Instances offer significantly lower pricing and are well-suited to workloads that can handle restarts, retries, or fluctuations in capacity. Without evaluating workload tolerance and adjusting pricing models accordingly, organizations risk consistently overpaying for compute.
Databricks Serverless Compute is now available for jobs and notebooks, offering a simplified, autoscaled compute environment that eliminates cluster provisioning, reduces idle overhead, and improves Spot survivability. For short-running, bursty, or interactive workloads, Serverless can significantly reduce cost by billing only for execution time. However, Serverless is not universally available or compatible with all workload types and libraries. Organizations that exclusively rely on traditional clusters may be missing emerging opportunities to reduce spend and simplify operations by leveraging Serverless where appropriate.
In Databricks, on-demand instances provide reliable performance but come at a premium cost. For non-production workloads—such as development, testing, or exploratory analysis—high availability is often unnecessary. Spot instances provide equivalent performance at a lower price, with the tradeoff of occasional interruptions. If teams default to on-demand usage in lower environments, they may be incurring unnecessary compute costs. Using compute policies to limit on-demand usage ensures greater consistency and efficiency across environments.
Databricks supports AWS Graviton-based instances for most workloads, including Spark jobs, data engineering pipelines, and interactive notebooks. These instances offer significant cost advantages over traditional x86-based VMs, with comparable or better performance in many cases. When teams default to legacy instance types, they miss an easy opportunity to reduce compute spend. Unless workloads have known compatibility issues or specialized requirements, Graviton should be the default instance family used in Databricks Clusters.