This guide walks through setting up a “send a copy to S3, query it later in Splunk when needed” flow. The shape is identical to the Datadog Logs Rehydration pattern: the Sawmills collector dual-routes logs — live traffic continues to your Splunk HEC destination for alerting and dashboards, while a full-fidelity copy lands in your own S3 bucket as hourly-partitioned .json.gz. When you need older data in Splunk, the Splunk Add-on for AWS pulls those S3 objects back into a Splunk index.
This is the simplest way to support Splunk customers who want to reduce live-ingest cost without losing access to historical logs.
Sawmills writes the same archive format that Datadog’s Logs Rehydration consumes (newline-delimited JSON, gzipped, hourly partitions). The format works with Splunk Cloud and Splunk Enterprise; the only thing that changes per-customer is which Splunk read mechanism you wire up at the end.
Prerequisites
- A Sawmills pipeline already shipping to a Splunk HEC destination.
- An AWS account where you can create an S3 bucket and an IAM user.
- Splunk Cloud or Splunk Enterprise admin access.
1. Add the AWS S3 destination in Sawmills
This is the archive side — Sawmills will write a copy of every log record to your S3 bucket alongside the live Splunk feed.
- Open your pipeline in Sawmills and click + Add Destination.
- Pick AWS S3.
- Fill in the form:
| Field | Value |
|---|
| Name | e.g., splunk-rehydration-archive |
| Region | The AWS region of your bucket (e.g., us-east-1) |
| S3 Bucket | The bucket name you created (e.g., acme-splunk-archive) |
| S3 Prefix | A path inside the bucket, e.g., /splunk/archive |
| Role ARN | Leave blank for static IAM keys, or set if you use cross-account assume-role |
| Output Format | NDJSON (.json.gz) |
When you pick Output Format = NDJSON (.json.gz), Sawmills automatically:
- Writes one JSON object per line (newline-delimited JSON), gzipped — the format Splunk’s Generic S3 input handles cleanly with no parsing config.
- Partitions objects by hour with a key like
dt=YYYYMMDD/hour=HH/archive_HHMMSS.NNNN.<random>.json.gz.
- Locks Enabled Data Types to Logs only.
For the AWS credentials the collector uses to write to S3, follow the credentials section in the AWS S3 destination reference.
- Click Save, then deploy the pipeline.
After deploy, list the bucket to confirm objects are arriving:
aws s3 ls s3://acme-splunk-archive/splunk/archive/dt=$(date -u +%Y%m%d)/ --recursive
You should see hourly-partitioned archive_*.json.gz files appearing within a minute or two.
2. Create an IAM user for Splunk to read the archive
Splunk needs read access to the S3 bucket. The Splunk Add-on for AWS only accepts long-term IAM access keys (not STS temporary credentials), so create a dedicated IAM user with read-only access.
The policy below grants Splunk just enough to (a) discover the bucket in its UI dropdown and (b) read objects from your archive bucket — nothing else.
aws iam create-user --user-name splunk-archive-reader
aws iam put-user-policy --user-name splunk-archive-reader \
--policy-name S3ArchiveReadOnly \
--policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListBucketsForSplunkUI",
"Effect": "Allow",
"Action": ["s3:ListAllMyBuckets", "s3:GetBucketLocation"],
"Resource": "*"
},
{
"Sid": "ReadArchiveBucket",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::acme-splunk-archive",
"arn:aws:s3:::acme-splunk-archive/*"
]
}
]
}'
aws iam create-access-key --user-name splunk-archive-reader
The last command prints AccessKeyId (starts with AKIA…) and SecretAccessKey. Save both — AWS only shows the secret once. Store them in a secret manager.
The ListAllMyBuckets permission lets Splunk see every bucket name in the account, but data access is still scoped to the archive bucket only. If that’s not acceptable, you can omit ListAllMyBuckets and type the bucket name manually in the Splunk input form instead of selecting it from the dropdown.
3. Install the Splunk Add-on for AWS
In Splunk:
- Go to Apps → Find more apps.
- Search for Splunk Add-on for Amazon Web Services.
- Click Install and follow the prompts. (On Splunk Cloud you may need to log in with your splunk.com account when prompted.)
4. Add the AWS account in the add-on
- Go to Apps → Splunk Add-on for AWS → Configuration → Account → Add.
- Fill in:
| Field | Value |
|---|
| Name | A label (e.g., acme-archive) |
| Key ID | The AKIA… access key from step 2 |
| Secret Key | The matching secret access key |
| Region Category | Global (unless your bucket is in GovCloud or China) |
- Click Add. Splunk validates the credentials before saving.
This is the read side — Splunk crawls the archive bucket and ingests files into a chosen index.
- Go to Apps → Splunk Add-on for AWS → Inputs → Create New Input → Custom Data Type → Generic S3.
- Fill in:
| Field | Value |
|---|
| Name | e.g., splunk-rehydration-archive |
| AWS Account | The account you added in step 4 |
| Assume Role | Leave empty |
| AWS Region | Your bucket’s region |
| S3 Bucket | The archive bucket (e.g., acme-splunk-archive) |
| S3 Key Prefix | A narrow prefix for the first run, e.g., splunk/archive/dt=20260504/ — broaden later |
| Polling Interval | 30 (default) |
| Initial Scan Datetime | A timestamp slightly before the oldest object you want to ingest, e.g., 2026-05-04T00:00:00Z |
| Sourcetype | _json (auto-extracts JSON fields with no parsing config) |
| Index | The Splunk index to ingest into (e.g., main) |
- Click Add. The input begins polling; the first scan typically starts within a minute.
6. Verify rehydration in Splunk Search
Open Search & Reporting and run:
index=main sourcetype=_json earliest=-1h
| head 20
Within a couple of minutes you should see records with auto-extracted fields: date, service, host, message, tags{}, and any attributes.* keys you had on the originating logs. To confirm a specific service or host:
index=main sourcetype=_json service="my-service" earliest=-1h
| stats count by host
If you see zero events:
-
Check Apps → Splunk Add-on for AWS → Inputs, click the input name, and look for status / errors.
-
Search the add-on’s internal logs:
index=_internal source=*splunk_ta_aws* log_level=ERROR earliest=-1h
-
Verify the IAM keys are still valid by checking the Configuration → Account tab.
Other ways to read S3 archives from Splunk
The Generic S3 input is the most universal option — it works on every Splunk tier, needs no special licensing, and accepts the format Sawmills writes. The trade-off is that it physically ingests S3 objects into a Splunk index, counting against your daily ingest license.
A few alternatives, in case they fit your environment better:
-
Federated Search for Amazon S3 (FSS3) — Splunk Cloud on AWS only. Queries the archive in place via an AWS Glue table, never ingests it. Closest UX to Datadog Logs Rehydration. Requires a Data Scan Unit license entitlement and a Glue table over the bucket. Best for ad-hoc forensic searches over large archives where you don’t want to pay for ingest.
-
Splunk Cloud Data Manager → “Promote data from AWS S3” — Splunk Cloud on AWS only. A first-party “true rehydration” feature that ingests a chosen prefix and time range into a dedicated infinite-retention index in one click. Per-job, not continuous.
-
SQS-based S3 input — same Splunk Add-on for AWS, different input type. Replaces polling with S3 → SNS → SQS event notifications, giving you live-tail latency on new objects. Splunk publishes an open-source crawler that emits synthetic SQS messages so you can backfill historical objects through the same path. Recommended over the Generic S3 input for ongoing tail at scale.
-
AWS Lambda → Splunk HEC — for environments where neither the Splunk Add-on nor FSS3 is available. S3 → SQS → Lambda function that POSTs to the HEC endpoint. Most code you’d own, but it works against any Splunk instance that exposes HEC.
Pick the Generic S3 input first to validate the pipeline end-to-end. You can always layer FSS3 or SQS-based input on top of the same archive bucket later — the Sawmills-side configuration doesn’t change.