This guide walks through setting up a “send a copy to S3, index it later in OpenSearch when needed” flow. Sawmills writes a full-fidelity archive copy to your own S3 bucket while live traffic continues to your OpenSearch destination. When you need older logs in OpenSearch, Amazon OpenSearch Ingestion or self-managed OpenSearch Data Prepper reads the S3 objects and writes them into an OpenSearch index.
Prerequisites
- A Sawmills pipeline already shipping logs to an Elasticsearch / OpenSearch destination.
- An AWS account where you can create an S3 bucket and grant read access to OpenSearch Ingestion or Data Prepper.
- An Amazon OpenSearch Service domain or an OpenSearch cluster.
- Admin access to create an OpenSearch Ingestion pipeline, or a self-managed OpenSearch Data Prepper deployment.
- A target rehydration index, such as
rehydrated-logs, rehydrated-logs-%{yyyy.MM.dd}, or another index naming pattern that fits your retention policy.
1. Add the AWS S3 destination in Sawmills
This is the archive side. Sawmills writes a copy of every log record to S3 alongside the live OpenSearch feed.
- Open your pipeline in Sawmills and click + Add Destination.
- Pick AWS S3.
- Fill in the form:
| Field | Value |
|---|
| Name | e.g., opensearch-rehydration-archive |
| Region | The AWS region of your bucket, e.g., us-west-2 |
| S3 Bucket | The bucket name you created, e.g., acme-opensearch-archive |
| S3 Prefix | A path inside the bucket, e.g., opensearch/archive |
| Role ARN | Leave blank for static IAM keys, or set if you use cross-account assume-role |
| Output Format | NDJSON (.json.gz) |
When you pick Output Format = NDJSON (.json.gz), Sawmills automatically:
- Writes one JSON object per line.
- Compresses the object with gzip.
- Partitions objects by hour with keys like
dt=YYYYMMDD/hour=HH/archive_HHMMSS.NNNN.<random>.json.gz.
- Locks Enabled Data Types to Logs only.
If your live OpenSearch destination has destination-specific processors that change field names, index names, or other document shape, decide whether the archive should store:
- Raw pipeline logs: simpler, but rehydrated documents may not exactly match the live OpenSearch destination.
- OpenSearch-shaped logs: add equivalent processors to the S3 archive destination so rehydrated documents match the live destination more closely.
For the AWS credentials the collector uses to write to S3, follow the credentials section in the AWS S3 destination reference.
- Click Save, then deploy the pipeline.
After deploy, list the bucket to confirm objects are arriving:
aws s3 ls s3://acme-opensearch-archive/opensearch/archive/dt=$(date -u +%Y%m%d)/ --recursive
You should see hourly-partitioned archive_*.json.gz files appearing within a minute or two.
2. Grant read access to the archive
OpenSearch Ingestion and Data Prepper need read access to the S3 bucket. If you use S3 event notifications, they also need access to the SQS queue.
For an Amazon OpenSearch Ingestion pipeline, configure the pipeline role with permissions to read the archive bucket and write to the OpenSearch sink. If you specify sts_role_arn in the pipeline configuration, use the same pipeline role in each component that declares it. AWS documents the S3 source role requirements in Using an OpenSearch Ingestion pipeline with Amazon S3.
A minimal S3 read policy looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListSawmillsArchiveBucket",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::acme-opensearch-archive"
},
{
"Sid": "ReadSawmillsArchiveObjects",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::acme-opensearch-archive/*"
}
]
}
If you use SQS notifications, add permissions such as sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:ChangeMessageVisibility on the queue used by the S3 source.
3. Choose scan or SQS mode
OpenSearch supports two common ways to read S3 archive objects:
| Mode | Use When | Notes |
|---|
| Scheduled or one-time S3 scan | You want to rehydrate a known historical prefix or time window. | Good first choice for an on-demand rehydration run. |
| S3 to SQS notifications | You want low-latency ingestion as new archive objects arrive. | Better for continuously tailing the archive. Requires S3 event notifications and SQS. |
OpenSearch Data Prepper’s s3 source supports both SQS notification processing and S3 scan processing. Amazon OpenSearch Ingestion also supports S3-SQS processing and scheduled scans.
For rehydrating a specific window, start with a narrow prefix, such as:
opensearch/archive/dt=20260504/hour=10/
The dt and hour folders represent when the archive object was written by the collector, in UTC. Choose start_time / end_time values that line up with those UTC partitions. Self-managed OpenSearch Data Prepper expects ISO LocalDateTime values without a timezone suffix, such as 2026-05-04T00:00:00. Amazon OpenSearch Ingestion scheduled scans use UTC instants with a Z suffix, such as 2026-05-04T00:00:00.000Z. If you need event-time filtering inside a large archive prefix, apply additional Data Prepper processors or OpenSearch queries after ingestion.
4. Create an OpenSearch Ingestion or Data Prepper pipeline
The pipeline reads the Sawmills archive from S3, parses each JSON line, and indexes the result into OpenSearch.
The self-managed Data Prepper example below uses S3 scan mode for an on-demand rehydration run:
version: "2"
sawmills-opensearch-rehydrate:
source:
s3:
codec:
newline:
compression: gzip
aws:
region: us-west-2
sts_role_arn: arn:aws:iam::123456789012:role/OpenSearchIngestionPipelineRole
scan:
start_time: 2026-05-04T00:00:00
end_time: 2026-05-04T01:00:00
buckets:
- bucket:
name: acme-opensearch-archive
filter:
include_prefix:
- opensearch/archive/dt=20260504/hour=00/
processor:
- parse_json:
sink:
- opensearch:
hosts:
- https://search-example-domain.us-west-2.es.amazonaws.com
index: rehydrated-logs-%{yyyy.MM.dd}
aws:
region: us-west-2
sts_role_arn: arn:aws:iam::123456789012:role/OpenSearchIngestionPipelineRole
The important parts are:
codec.newline: reads each NDJSON line as one event.
compression: gzip: reads the .json.gz archive objects written by Sawmills.
parse_json: parses each JSON line into fields before indexing.
include_prefix: limits the run to the archive prefix you want to rehydrate.
index: writes into a dedicated rehydration index, avoiding accidental duplicates in the live index. The %{yyyy.MM.dd} placeholder resolves to ingestion time by default, so all rehydrated records land in the index for the day you ran the job. To group by original event time instead, add a date processor that maps your event timestamp field into @timestamp before the sink, or use a static index name such as rehydrated-logs.
OpenSearch Ingestion supports a subset of Data Prepper plugins and options. Validate the pipeline configuration against the OpenSearch Ingestion version and AWS account limits you are using.
5. Verify rehydration in OpenSearch
After the pipeline starts, verify that documents were indexed into the target rehydration index:
GET rehydrated-logs-*/_search
{
"size": 20,
"query": {
"match_all": {}
}
}
Search for a specific service or host:
GET rehydrated-logs-*/_search
{
"size": 20,
"query": {
"match": {
"service": "my-service"
}
}
}
If you see zero events:
- Confirm S3 objects exist under the exact prefix used by the pipeline.
- Confirm the pipeline role can read the bucket and, if configured, the SQS queue.
- Confirm the OpenSearch sink can write to the target index.
- Check OpenSearch Ingestion or Data Prepper logs for S3 read errors, JSON parse errors, and OpenSearch bulk indexing failures.
- Try a narrower S3 prefix first, then broaden after the first successful run.
Operational guidance
Create an index template before rehydration if you need strict mappings for fields such as timestamps, keywords, and nested attributes. Without a template, OpenSearch dynamic mappings may infer types differently than your live index.
Keep the S3 prefix narrow for backfills. A daily prefix can contain many objects, and OpenSearch Ingestion/Data Prepper will read and index everything that matches the scan.
Delete or expire the rehydration index when the investigation is complete. The S3 archive remains the long-term source of truth.