Inefficient File Format and Layout for Athena Queries

Erik Marke

Service Category

Compute

Cloud Provider

AWS

Service Name

AWS Athena

Inefficiency Type

Suboptimal Data Layout or Format

Explanation

Storing raw JSON or CSV files in S3—especially when written frequently in small batches—leads to excessive scan costs in Athena. These formats are row-based and verbose, requiring Athena to scan and parse the full content even when only a few fields are queried. Without columnar formats, partitioning, or metadata-aware table formats, queries become inefficient and expensive, especially in high-volume environments.

Relevant Billing Model

Pay-per-scan — Athena charges based on the amount of data scanned per query, not the size of the result set. This makes storage format, partitioning, and file layout critical cost drivers.

Detection

Review whether data stored in S3 is in raw formats like JSON or CSV
Check if files are written at high frequency, resulting in large numbers of small files
Evaluate whether data is partitioned by commonly filtered fields (e.g., date, region)
Determine if users repeatedly query overlapping data ranges without partition pruning
Assess whether queries experience slow performance or high latency for recent data

Remediation

Convert raw data to columnar formats such as Parquet or ORC to reduce scan size
Partition data based on common query dimensions (e.g., date, tenant ID)
Consolidate small files into larger batches to improve scan efficiency
Adopt table formats like Apache Iceberg, Delta Lake, or Hudi for metadata support and schema evolution
Stream data into these formats using structured pipelines to reduce latency and support efficient querying
Refactor query patterns or create materialized views to reduce redundant scans

Relevant Documentation

Best Practices for Amazon Athena
Using Parquet with Amazon Athena
Amazon Athena Pricing

Submit Feedback