Excessive Lambda Retries (Retry Storms)

Liam Greenamyre

Service Category

Compute

Cloud Provider

AWS

Service Name

AWS Lambda

Inefficiency Type

Retry Misconfiguration

Explanation

Retry storms occur when a function fails and is automatically retried repeatedly due to default retry behavior for asynchronous events (e.g., SQS, EventBridge). If the error is persistent and unhandled, retries can accumulate rapidly — often invisibly — creating a large volume of billable executions with no successful outcome. This is especially costly when functions run for extended durations or have high memory allocation.

Relevant Billing Model

* Each retry counts as a separate request * Billed per millisecond of execution time for each retry * Retries from failed invocations can exponentially increase cost if not controlled

Detection

Check for high error-to-success ratios in CloudWatch Metrics
Investigate spikes in invocation count after initial failure events
Correlate retries with common payloads and error messages
Review configurations for MaximumRetryAttempts and DLQs
Assess the presence of throttling, downstream timeouts, or missing exception handling

Remediation

Configure DLQs to isolate and inspect failed invocations
Implement exponential backoff or circuit breaker patterns in retry logic
Set appropriate retry limits on event source mappings
Improve exception handling and idempotency in function code
Utilize anomaly detection software to alert on abnormal retry patterns
Monitor CloudWatch for retry-specific metrics and alarms

Relevant Documentation

https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html\#invocation-dlq
https://docs.aws.amazon.com/lambda/latest/dg/invocation-eventsourcemapping.html\#invocation-eventsourcemapping-retries
https://aws.amazon.com/lambda/pricing/

Submit Feedback