S3 logs

    Data Prepper can read objects from S3 buckets using an (Amazon SQS) queue and Amazon S3 Event Notifications.

    Data Prepper polls the Amazon SQS queue for S3 event notifications. When Data Prepper receives a notification that an S3 object was created, Data Prepper reads and parses that S3 object.

    The following diagram shows the overall architecture of the components involved.

    The flow of data is as follows.

    1. S3 creates an S3 event notification in the SQS queue.
    2. Data Prepper polls Amazon SQS for messages and then receives a message.
    3. Data Prepper downloads the content from the S3 object.
    4. Data Prepper sends a document to OpenSearch for the content in the S3 object.

    Pipeline overview

    Data Prepper supports reading data from S3 using the s3 source.

    S3 source architecture

    Before Data Prepper can read log data from S3, you need the following prerequisites:

    • An S3 bucket.
    • A log producer that writes logs to S3. The exact log producer will vary depending on your specific use case, but could include writing logs to S3 or a service such as Amazon CloudWatch.

    Getting started

    Use the following steps to begin loading logs from S3 with Data Prepper.

    1. Create an for your S3 event notifications.
    2. Configure bucket notifications for SQS. Use the event type.
    3. (Recommended) Create an (DLQ).
    4. (Recommended) Configure an SQS re-drive policy to move failed messages into the DLQ.

    To view S3 logs, Data Prepper needs access to Amazon SQS and S3. Use the following example to set up permissions:

    If your S3 objects or SQS queues do not use KMS, you can remove the kms:Decrypt permission.

    SQS dead-letter queue

    The are two options for how to handle errors resulting from processing S3 objects.

    • Use an SQS dead-letter queue (DLQ) to track the failure. This is the recommended approach.
    • Delete the message from SQS. You must manually find the S3 object and correct the error.

    To use an SQS dead-letter queue, perform the following steps:

    1. Create a new SQS standard queue to act as your DLQ.
    2. Configure your SQS’s redrive policy to use your DLQ. Consider using a low value such as 2 or 3 for the “Maximum Receives” setting.
    3. Configure the Data Prepper source to use retain_messages for . This is the default behavior.

    Create a pipeline to read logs from S3, starting with an s3 source plugin. Use the following example for guidance.

    Configure the following options according to your use case:

    • queue_url: This the SQS queue URL and is always unique to your pipeline.
    • visibility_timeout: Configure this value to be large enough for Data Prepper to process 10 S3 objects. However, if you make this value too large, messages that fail to process will take at least as long as the specified value before Data Prepper retries.

    The default values for each option work for the majority of use cases. For all available options for the S3 source, see .

    Multiple Data Prepper pipelines

    We recommend that you have one SQS queue per Data Prepper pipeline. In addition, you can have multiple nodes in the same cluster reading from the same SQS queue, which doesn’t require additional configuration with Data Prepper.