Generative workloads that produce long outputs—such as detailed summaries, document rewrites, or multi-paragraph chat completions—require extended model runtime.
Vertex AI generative models are billed per input and output token. Retries—especially those triggered by timeouts or long model latencies—cause repeated inference calls and duplicate token charges. This increases costs without delivering additional value.