This guide walks through setting up a “send a copy to S3, query it later in Splunk when needed” flow. The shape is identical to the Datadog Logs Rehydration pattern: the Sawmills collector dual-routes logs — live traffic continues to your Splunk HEC destination for alerting and dashboards, while a full-fidelity copy lands in your own S3 bucket as hourly-partitionedDocumentation Index
Fetch the complete documentation index at: https://docs.sawmills.ai/llms.txt
Use this file to discover all available pages before exploring further.
.json.gz. When you need older data in Splunk, the Splunk Add-on for AWS pulls those S3 objects back into a Splunk index.
This is the simplest way to support Splunk customers who want to reduce live-ingest cost without losing access to historical logs.
Sawmills writes the same archive format that Datadog’s Logs Rehydration consumes (newline-delimited JSON, gzipped, hourly partitions). The format works with Splunk Cloud and Splunk Enterprise; the only thing that changes per-customer is which Splunk read mechanism you wire up at the end.
Prerequisites
- A Sawmills pipeline already shipping to a Splunk HEC destination.
- An AWS account where you can create an S3 bucket and an IAM user.
- Splunk Cloud or Splunk Enterprise admin access.
1. Add the AWS S3 destination in Sawmills
This is the archive side — Sawmills will write a copy of every log record to your S3 bucket alongside the live Splunk feed.- Open your pipeline in Sawmills and click + Add Destination.
- Pick AWS S3.
- Fill in the form:
| Field | Value |
|---|---|
| Name | e.g., splunk-rehydration-archive |
| Region | The AWS region of your bucket (e.g., us-east-1) |
| S3 Bucket | The bucket name you created (e.g., acme-splunk-archive) |
| S3 Prefix | A path inside the bucket, e.g., /splunk/archive |
| Role ARN | Leave blank for static IAM keys, or set if you use cross-account assume-role |
| Output Format | NDJSON (.json.gz) |
NDJSON (.json.gz), Sawmills automatically:
- Writes one JSON object per line (newline-delimited JSON), gzipped — the format Splunk’s Generic S3 input handles cleanly with no parsing config.
- Partitions objects by hour with a key like
dt=YYYYMMDD/hour=HH/archive_HHMMSS.NNNN.<random>.json.gz. - Locks Enabled Data Types to Logs only.
- Click Save, then deploy the pipeline.
archive_*.json.gz files appearing within a minute or two.
2. Create an IAM user for Splunk to read the archive
Splunk needs read access to the S3 bucket. The Splunk Add-on for AWS only accepts long-term IAM access keys (not STS temporary credentials), so create a dedicated IAM user with read-only access. The policy below grants Splunk just enough to (a) discover the bucket in its UI dropdown and (b) read objects from your archive bucket — nothing else.AccessKeyId (starts with AKIA…) and SecretAccessKey. Save both — AWS only shows the secret once. Store them in a secret manager.
3. Install the Splunk Add-on for AWS
In Splunk:- Go to Apps → Find more apps.
- Search for Splunk Add-on for Amazon Web Services.
- Click Install and follow the prompts. (On Splunk Cloud you may need to log in with your splunk.com account when prompted.)
4. Add the AWS account in the add-on
- Go to Apps → Splunk Add-on for AWS → Configuration → Account → Add.
- Fill in:
| Field | Value |
|---|---|
| Name | A label (e.g., acme-archive) |
| Key ID | The AKIA… access key from step 2 |
| Secret Key | The matching secret access key |
| Region Category | Global (unless your bucket is in GovCloud or China) |
- Click Add. Splunk validates the credentials before saving.
5. Create the Generic S3 input
This is the read side — Splunk crawls the archive bucket and ingests files into a chosen index.- Go to Apps → Splunk Add-on for AWS → Inputs → Create New Input → Custom Data Type → Generic S3.
- Fill in:
| Field | Value |
|---|---|
| Name | e.g., splunk-rehydration-archive |
| AWS Account | The account you added in step 4 |
| Assume Role | Leave empty |
| AWS Region | Your bucket’s region |
| S3 Bucket | The archive bucket (e.g., acme-splunk-archive) |
| S3 Key Prefix | A narrow prefix for the first run, e.g., splunk/archive/dt=20260504/ — broaden later |
| Polling Interval | 30 (default) |
| Initial Scan Datetime | A timestamp slightly before the oldest object you want to ingest, e.g., 2026-05-04T00:00:00Z |
| Sourcetype | _json (auto-extracts JSON fields with no parsing config) |
| Index | The Splunk index to ingest into (e.g., main) |
- Click Add. The input begins polling; the first scan typically starts within a minute.
6. Verify rehydration in Splunk Search
Open Search & Reporting and run:date, service, host, message, tags{}, and any attributes.* keys you had on the originating logs. To confirm a specific service or host:
- Check Apps → Splunk Add-on for AWS → Inputs, click the input name, and look for status / errors.
-
Search the add-on’s internal logs:
- Verify the IAM keys are still valid by checking the Configuration → Account tab.
Other ways to read S3 archives from Splunk
The Generic S3 input is the most universal option — it works on every Splunk tier, needs no special licensing, and accepts the format Sawmills writes. The trade-off is that it physically ingests S3 objects into a Splunk index, counting against your daily ingest license. A few alternatives, in case they fit your environment better:- Federated Search for Amazon S3 (FSS3) — Splunk Cloud on AWS only. Queries the archive in place via an AWS Glue table, never ingests it. Closest UX to Datadog Logs Rehydration. Requires a Data Scan Unit license entitlement and a Glue table over the bucket. Best for ad-hoc forensic searches over large archives where you don’t want to pay for ingest.
- Splunk Cloud Data Manager → “Promote data from AWS S3” — Splunk Cloud on AWS only. A first-party “true rehydration” feature that ingests a chosen prefix and time range into a dedicated infinite-retention index in one click. Per-job, not continuous.
- SQS-based S3 input — same Splunk Add-on for AWS, different input type. Replaces polling with S3 → SNS → SQS event notifications, giving you live-tail latency on new objects. Splunk publishes an open-source crawler that emits synthetic SQS messages so you can backfill historical objects through the same path. Recommended over the Generic S3 input for ongoing tail at scale.
- AWS Lambda → Splunk HEC — for environments where neither the Splunk Add-on nor FSS3 is available. S3 → SQS → Lambda function that POSTs to the HEC endpoint. Most code you’d own, but it works against any Splunk instance that exposes HEC.