CER-0317
Amazon Bedrock Provisioned Throughput allows teams to reserve dedicated inference capacity for foundation models by purchasing model units with hourly billing under a commitment term. This capacity is billed continuously — whether or not any tokens are actually processed — making it a fixed cost that only pays off when sustained, high-volume token consumption justifies the premium over on-demand pricing. In practice, teams frequently purchase Provisioned Throughput to avoid on-demand throttling limits, but actual usage often falls well below the committed capacity, resulting in significant overspend compared to what on-demand pricing would have cost for the same workload.
The waste is compounded by the fact that Provisioned Throughput commitments cannot be canceled before the term expires — billing continues hourly until the commitment period ends. This means a team that overestimates its inference needs at the time of purchase is locked into paying for unused capacity for the full duration. The problem is especially common in early-stage AI deployments where usage patterns are not yet well understood, or in workloads with variable or unpredictable token volumes that are poorly suited to fixed-capacity reservations.
The cost impact can be substantial. A single model unit for even a moderately priced model can cost tens of thousands of dollars per month, and if actual token consumption would have cost only a fraction of that amount under on-demand pricing, the difference represents pure waste. Organizations running multiple Provisioned Throughput reservations across different models or environments can multiply this inefficiency significantly.
Amazon Bedrock offers two primary inference pricing modes:
Each model unit delivers a specific throughput level measured in input and output tokens per minute, though exact throughput specifications per model unit are not publicly documented — AWS directs customers to contact their account team for these details. Provisioned Throughput pricing for most models is also not listed publicly on the pricing page.
Additionally, batch inference is available at a 50% discount compared to on-demand pricing for select models, offering a middle ground for non-real-time workloads that do not require dedicated capacity.