Pipelines

To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example:

Sources define where your data comes from. In this case, the source is a random UUID generator ().
Buffers store data as it passes through the pipeline.

By default, Data Prepper uses its one and only buffer, the bounded_blocking buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings.
Processors perform some action on your data: filter, transform, enrich, etc.

You can have multiple processors, which run sequentially from top to bottom, not in parallel. The string_converter processor transform the strings by making them uppercase.
Sinks define where your data goes. In this case, the sink is stdout.

Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the pipelines folder under your application’s home directory (e.g. /usr/share/data-prepper).

In the following example, application-logs is a named route with a condition set to /log_type == "application". The route uses to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all Events to a sink which does not define a route. In the example, all Events route into the third OpenSearch sink.

conditional-routing-sample-pipeline:
  source:
    http:
  processor:
  route:
    - application-logs: '/log_type == "application"'
    - http-logs: '/log_type == "apache"'
  sink:
    - opensearch:
        hosts: [ "https://opensearch:9200" ]
        index: application_logs
        routes: [application-logs]
    - opensearch:
        hosts: [ "https://opensearch:9200" ]
        index: http_logs
        routes: [http-logs]
    - opensearch:
        hosts: [ "https://opensearch:9200" ]
        index: all_logs

This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see guide.

The Data Prepper repository has several sample applications to help you get started.

The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data.

log-pipeline:
  source:
    http:
  processor:
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
        hosts: [ "https://opensearch:9200" ]
        insecure: true
        username: admin
        password: admin
        index: apache_logs

This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments.

The following example demonstrates how to build a pipeline that supports the . This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin.

Starting from Data Prepper 2.0, Data Prepper no longer supports otel_trace_raw_prepper processor due to the Data Prepper internal data model evolution. Instead, users should use otel_trace_raw.

To maintain similar ingestion throughput and latency, scale the buffer_size and batch_size by the estimated maximum batch size in the client request payload.

Gauge
Sum
Summary
Histogram

Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation.

To set up a metrics pipeline:

metrics-pipeline:
  source:
    otel_metrics_source:
  processor:
    - otel_metrics_raw_processor:
  sink:
    - opensearch:
      hosts: ["https://localhost:9200"]
      username: admin
      password: admin

The following example demonstrates how to use the S3 Source and Grok Processor plugins to process unstructured log data from Amazon Simple Storage Service (Amazon S3). This example uses Application Load Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper reads those notifications and reads the S3 objects to get the log data and process it.

log-pipeline:
  source:
    s3:
      notification_type: "sqs"
      compression: "gzip"
      codec:
        newline:
      sqs:
        queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer"
      aws:
        sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper"
  processor:
    - grok:
        match:
          message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"]
    - grok:
          request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"]
    - grok:
        match:
          http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"]
    - date:
        from_time_received: true
        destination: "@timestamp"
  sink:
    - opensearch:
        hosts: [ "https://localhost:9200" ]
        username: "admin"
        password: "admin"
        index: alb_logs

Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper.

This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported:

HTTP Input plugin
Grok Filter plugin
Elasticsearch Output plugin
Amazon Elasticsearch Output plugin

Data Prepper itself provides administrative HTTP endpoints such as /list to list pipelines and /metrics/prometheus to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example data-prepper-config.yaml:

ssl: true
keyStoreFilePath: "/usr/share/data-prepper/keystore.jks"
keyStorePassword: "password"
privateKeyPassword: "other_password"
serverPort: 1234

To configure the Data Prepper server, run Data Prepper with the additional yaml file.

docker run --name data-prepper \
    -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \
    -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \

Data Prepper provides an HTTP service to forward Events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in aggregate, service_map_stateful, and otel_trace_raw processors. Peer forwarder groups events based on the identification keys provided by the processors. For service_map_stateful and otel_trace_raw it’s traceId by default and can not be configured. For aggregate processor, it is configurable using identification_keys option.

To configure the peer forwarder, add configuration options to data-prepper-config.yaml mentioned in the previous Configure the Data Prepper server section: