Monitoring Checkpointing

    Monitoring

    • Checkpoint Counts
      • Triggered: The total number of checkpoints that have been triggered since the job started.
      • In Progress: The current number of checkpoints that are in progress.
      • Completed: The total number of successfully completed checkpoints since the job started.
      • Failed: The total number of failed checkpoints since the job started.
    • Latest Completed Checkpoint: The latest successfully completed checkpoints. Clicking on gives you detailed statistics down to the subtask level.
    • Latest Failed Checkpoint: The latest failed checkpoint. Clicking on gives you detailed statistics down to the subtask level.
    • Latest Savepoint: The latest triggered savepoint with its external path. Clicking on gives you detailed statistics down to the subtask level.
    • Latest Restore: There are two types of restore operations.
      • Restore from Checkpoint: We restored from a regular periodic checkpoint.

    History Tab

    - ID: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.- Status: The current status of the checkpoint, which is either In Progress (), Completed (), or Failed (). If the triggered checkpoint is a savepoint, you will see a symbol.- Trigger Time: The time when the checkpoint was triggered at the JobManager.- Latest Acknowledgement: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).- End to End Duration: The duration from the trigger timestamp until the latest acknowledgement (or n/a if no acknowledgement received yet). This end to end duration for a complete checkpoint is determined by the last subtask that acknowledges the checkpoint. This time is usually larger than single subtasks need to actually checkpoint the state.- State Size: The state size over all acknowledged subtasks.- Buffered During Alignment: The number of bytes buffered during alignment over all acknowledged subtasks. This is only > 0 if a stream alignment takes place during checkpointing. If the checkpointing mode is this will always be zero as at least once mode does not require stream alignment.#### History Size ConfigurationYou can configure the number of recent checkpoints that are remembered for the history via the following configuration key. The default is .### Summary TabThe summary computes a simple min/average/maximum statistics over all completed checkpoints for the End to End Duration, State Size, and Bytes Buffered During Alignment (see History for details about what these mean).
    Checkpoint Monitoring: Summary
    Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.### Configuration TabThe configuration list your streaming configuration:- Checkpointing Mode: Either _Exactly Once
    or At least Once.- Interval: The configured checkpointing interval. Trigger checkpoints in this interval.- Timeout: Timeout after which a checkpoint is cancelled by the JobManager and a new checkpoint is triggered.- Minimum Pause Between Checkpoints: Minimum required pause between checkpoints. After a checkpoint has completed successfully, we wait at least for this amount of time before triggering the next one, potentially delaying the regular interval.- Maximum Concurrent Checkpoints: The maximum number of checkpoints that can be in progress concurrently.- Persist Checkpoints Externally: Enabled or Disabled. If enabled, furthermore lists the cleanup config for externalized checkpoints (delete or retain on cancellation).### Checkpoint DetailsWhen you click on a More details link for a checkpoint, you get a Minimum/Average/Maximum summary over all its operators and also the detailed numbers per single subtask.

    Summary per Operator

    Checkpoint Monitoring: Details Summary

    All Subtask Statistics