Observability Best Practices

    Although installing Istio does not deploy by default, the Getting Started instructions install the deployment of Prometheus described in the . This deployment of Prometheus is intentionally configured with a very short retention window (6 hours). The quick-start Prometheus deployment is also configured to collect metrics from each Envoy proxy running in the mesh, augmenting each metric with a set of labels about their origin (instance, pod, and namespace).

    Production-scale Istio monitoring with Istio

    In order to aggregate metrics across instances and pods, update the default Prometheus configuration with the following recording rules:

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: PrometheusRule
    3. metadata:
    4. name: istio-metrics-aggregation
    5. labels:
    6. app.kubernetes.io/name: istio-prometheus
    7. spec:
    8. groups:
    9. - name: "istio.metricsAggregation-rules"
    10. interval: 5s
    11. rules:
    12. - record: "workload:istio_requests_total"
    13. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)"
    14. - record: "workload:istio_request_duration_milliseconds_count"
    15. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)"
    16. - record: "workload:istio_request_duration_milliseconds_sum"
    17. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)"
    18. - record: "workload:istio_request_duration_milliseconds_bucket"
    19. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)"
    20. - record: "workload:istio_request_bytes_count"
    21. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)"
    22. - record: "workload:istio_request_bytes_sum"
    23. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)"
    24. - record: "workload:istio_request_bytes_bucket"
    25. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)"
    26. - record: "workload:istio_response_bytes_count"
    27. - record: "workload:istio_response_bytes_sum"
    28. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)"
    29. - record: "workload:istio_response_bytes_bucket"
    30. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)"
    31. - record: "workload:istio_tcp_connections_opened_total"
    32. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)"
    33. - record: "workload:istio_tcp_connections_closed_total"
    34. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)"
    35. - record: "workload:istio_tcp_sent_bytes_total_count"
    36. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_count)"
    37. - record: "workload:istio_tcp_sent_bytes_total_sum"
    38. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_sum)"
    39. - record: "workload:istio_tcp_sent_bytes_total_bucket"
    40. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_bucket)"
    41. - record: "workload:istio_tcp_received_bytes_total_count"
    42. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_count)"
    43. - record: "workload:istio_tcp_received_bytes_total_sum"
    44. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_sum)"
    45. - record: "workload:istio_tcp_received_bytes_total_bucket"
    46. expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_bucket)"

    The recording rules above only aggregate across pods and instances. They still preserve the full set of , including all Istio dimensions. While this will help with controlling metrics cardinality via federation, you may want to further optimize the recording rules to match your existing dashboards, alerts, and ad-hoc queries.

    For more information on tailoring your recording rules, see the section on Optimizing metrics collection with recording rules.

    To establish Prometheus federation, modify the configuration of your production-ready deployment of Prometheus to scrape the federation endpoint of the Istio Prometheus.

    If you are using the , use the following configuration instead:

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: ServiceMonitor
    3. metadata:
    4. name: istio-federation
    5. labels:
    6. app.kubernetes.io/name: istio-prometheus
    7. spec:
    8. namespaceSelector:
    9. matchNames:
    10. matchLabels:
    11. app: prometheus
    12. endpoints:
    13. - interval: 30s
    14. scrapeTimeout: 30s
    15. params:
    16. 'match[]':
    17. - '{__name__=~"workload:(.*)"}'
    18. - '{__name__=~"pilot(.*)"}'
    19. path: /federate
    20. targetPort: 9090
    21. honorLabels: true
    22. metricRelabelings:
    23. - sourceLabels: ["__name__"]
    24. regex: 'workload:(.*)'
    25. targetLabel: "__name__"
    26. action: replace

    The key to the federation configuration is matching on the job in the Istio-deployed Prometheus that is collecting Istio Standard Metrics and renaming any metrics collected by removing the prefix used in the workload-level recording rules (workload:). This will allow existing dashboards and queries to seamlessly continue working when pointed at the production Prometheus instance (and away from the Istio instance).

    You can also include additional metrics (for example, envoy, go, etc.) when setting up federation.

    Control plane metrics are also collected and federated up to the production Prometheus.

    Beyond just using recording rules to , you may want to use recording rules to generate aggregated metrics tailored specifically to your existing dashboards and alerts. Optimizing your collection in this manner can result in large savings in resource consumption in your production instance of Prometheus, in addition to faster query performance.

    For example, imagine a custom monitoring dashboard that used the following Prometheus queries:

    • Total rate of requests averaged over the past minute by destination service name and namespace

      1. histogram_quantile(0.95,
      2. sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
      3. by (
      4. destination_canonical_service,
      5. destination_workload_namespace,
      6. source_canonical_service,
      7. source_workload_namespace,
      8. le
      9. )
      10. )

    The following set of recording rules could be added to the Istio Prometheus configuration, using the istio prefix to make identifying these metrics for federation simple.

    The production instance of Prometheus would then be updated to federate from the Istio instance with:

    • match clause of {__name__=~"istio:(.*)"}

    • metric relabeling config with: regex: "istio:(.*)"

    The original queries would then be replaced with:

    • avg(istio_request_duration_milliseconds_bucket:p95:rate1m)