Submit feedback on
Inefficient File Format and Layout for Athena Queries
We've received your feedback.
Thanks for reaching out!
Oops! Something went wrong while submitting the form.
Close
Inefficient File Format and Layout for Athena Queries
Erik Marke
Service Category
Compute
Cloud Provider
AWS
Service Name
AWS Athena
Inefficiency Type
Suboptimal Data Layout or Format
Explanation

Storing raw JSON or CSV files in S3—especially when written frequently in small batches—leads to excessive scan costs in Athena. These formats are row-based and verbose, requiring Athena to scan and parse the full content even when only a few fields are queried. Without columnar formats, partitioning, or metadata-aware table formats, queries become inefficient and expensive, especially in high-volume environments.

Relevant Billing Model

Pay-per-scan — Athena charges based on the amount of data scanned per query, not the size of the result set. This makes storage format, partitioning, and file layout critical cost drivers.

Detection
  • Review whether data stored in S3 is in raw formats like JSON or CSV
  • Check if files are written at high frequency, resulting in large numbers of small files
  • Evaluate whether data is partitioned by commonly filtered fields (e.g., date, region)
  • Determine if users repeatedly query overlapping data ranges without partition pruning
  • Assess whether queries experience slow performance or high latency for recent data
Remediation
  • Convert raw data to columnar formats such as Parquet or ORC to reduce scan size
  • Partition data based on common query dimensions (e.g., date, tenant ID)
  • Consolidate small files into larger batches to improve scan efficiency
  • Adopt table formats like Apache Iceberg, Delta Lake, or Hudi for metadata support and schema evolution
  • Stream data into these formats using structured pipelines to reduce latency and support efficient querying
  • Refactor query patterns or create materialized views to reduce redundant scans
Submit Feedback