Non-production Azure VMs are frequently left running during off-hours despite being used only during business hours. When these instances remain active overnight or on weekends, they generate unnecessary compute spend. Azure offers built-in auto-shutdown features that allow teams to define daily stop times, retaining disk data and configurations without paying for VM runtime. Implementing scheduled shutdowns in dev/test environments is a simple, low-risk optimization that can reduce compute costs by 30–60%.
In many environments, users launch Databricks clusters for development or analysis and forget to shut them down after use. When no auto-termination policy is configured, these clusters remain active indefinitely, incurring unnecessary charges for both Databricks and cloud infrastructure usage. This inefficiency is especially common in interactive clusters that are user-managed, ephemeral, or exploratory in nature. While Databricks provides built-in support for cluster auto-termination, teams often overlook it unless it's enforced through workspace policies. Without this safeguard in place, idle clusters can persist unnoticed for hours or days, leading to avoidable cost.
Photon is optimized for SQL workloads, delivering significant speedups through vectorized execution and native C++ performance. However, Photon only accelerates workloads that use compatible operations and data patterns. If a workload includes unsupported functions, unoptimized joins, or falls back to interpreted execution, Photon may be silently bypassed — even on a Photon-enabled cluster. In this case, users are billed at a premium DBU rate while receiving no meaningful speed or efficiency gain. This inefficiency typically arises when teams enable Photon globally without validating workload compatibility or updating their pipelines to follow Photon best practices. The result is higher costs with no corresponding benefit — a classic case of configuration drift outpacing optimization discipline.
Each Lambda function must be configured with a memory setting, which indirectly controls the amount of CPU and networking performance allocated. In many environments, memory settings are defined arbitrarily or left unchanged as functions evolve. Over time, this leads to overprovisioning — with functions running well below their allocated memory and incurring unnecessary compute costs. Systematic right-sizing using performance benchmarks can significantly reduce spend without sacrificing performance or reliability. This is especially important for frequently invoked functions or those with long execution times.
While many AWS customers have migrated EC2 workloads to Graviton to reduce costs, Lambda functions often remain on the default x86 architecture. AWS Graviton2 (ARM) offers lower pricing and equal or better performance for most supported runtimes — yet adoption remains uneven due to legacy defaults or lack of awareness. Continuing to run eligible Lambda functions on x86 leads to unnecessary spending. The migration requires minimal configuration changes and can be verified through benchmarking and workload testing.
Many EC2 workloads—such as development environments, test jobs, stateless services, and data processing pipelines—can tolerate interruptions and do not require the reliability of On-Demand pricing. Using On-Demand instances in these scenarios drives up cost without adding value. Spot Instances offer significantly lower pricing and are well-suited to workloads that can handle restarts, retries, or fluctuations in capacity. Without evaluating workload tolerance and adjusting pricing models accordingly, organizations risk consistently overpaying for compute.
Databricks supports AWS Graviton-based instances for most workloads, including Spark jobs, data engineering pipelines, and interactive notebooks. These instances offer significant cost advantages over traditional x86-based VMs, with comparable or better performance in many cases. When teams default to legacy instance types, they miss an easy opportunity to reduce compute spend. Unless workloads have known compatibility issues or specialized requirements, Graviton should be the default instance family used in Databricks Clusters.
In Databricks, on-demand instances provide reliable performance but come at a premium cost. For non-production workloads—such as development, testing, or exploratory analysis—high availability is often unnecessary. Spot instances provide equivalent performance at a lower price, with the tradeoff of occasional interruptions. If teams default to on-demand usage in lower environments, they may be incurring unnecessary compute costs. Using compute policies to limit on-demand usage ensures greater consistency and efficiency across environments.
Databricks users can select from a wide range of instance types for cluster driver and worker nodes. Without guardrails, teams may choose high-cost configurations (e.g., 16xlarge nodes) that exceed workload requirements. This results in inflated costs with little performance benefit. To reduce this risk, administrators can use compute policies to define acceptable node types and enforce size limits across the workspace.
Interactive clusters are often left running between periods of active use. To mitigate idle charges, Databricks provides an “autotermination” setting that shuts down clusters after a period of inactivity. However, if the termination period is set too high, or if policies do not enforce reasonable thresholds, idle clusters can persist for long durations without performing any work—resulting in wasted compute spend. Lowering the termination window reduces exposure to idle time while preserving user flexibility.