Troubleshooting datacenter to datacenter replication

    The datacenter to datacenter replication is a distributed system with a lotdifferent components. As with any such system, it requires some, but not a lot,of operational support.

    This section includes information on how to troubleshoot thedatacenter to datacenter replication.

    For a general introduction to the datacenter to datacenter replication, pleaserefer to the chapter.

    All of the components of ArangoSync provide means to monitor their status.Below you’ll find an overview per component.

    • Sync master & workers: The servers running as either masteror worker, provide:
      • A status API, see arangosync get status. Make sure that all statuses report running.For even more detail the following commands are also available:arangosync get tasks, arangosync get masters & arangosync get workers.
      • A log on the standard output. Log levels can be configured using —log.level settings.
    • ArangoDB cluster: The arangod servers that make up the ArangoDB clusterprovide:
      • A log file. This is configurable with settings with a log. prefix.E.g. or —log.level=info.
      • A statistics API GET /_admin/statistics
    • Kafka cluster: The kafka brokers provide:
      • A log file, see settings with log. prefix in its configuration file.
    • Zookeeper: The zookeeper agents provide:
      • A log on standard output.

    What to look for while monitoring status

    The very first thing to do when monitoring the status of ArangoSync is tolook into the status provided by arangosync get status … -v.When not everything is in the running state (on both datacenters), this is anindication that something may be wrong. In case that happens, give it some time(incremental synchronization may take quite some time for large collections)and look at the status again. If the statuses do not change (or change, but not reach running)it is time to inspects the metrics & log files. When the metrics or logs seem to indicate a problem in a sync master or worker, it issafe to restart it, as long as only 1 instance is restarted at a time.Give restarted instances some time to “catch up”.

    When a problem remains and restarting masters/workers does not solve the problem,contact support. Make sure to include provide support with the following information:

    • Output of arangosync get version … on both datacenters.
    • Output of arangosync get status … -v on both datacenters.
    • Output of arangosync get masters … -v on both datacenters.
    • Output of arangosync get workers … -v on both datacenters.
    • Log files of all components
    • How to monitor status of ArangoSync

    • How to keep it alive
    • What to do in case of failures or bugs

    What to do when a source datacenter is down

    When you use ArangoSync for backup of your cluster from one datacenterto another and the source datacenter has a complete outage, you may considerswitching your applications to the target (backup) datacenter.

    This is what you must do in that case:

    When the source datacenter is completely unresponsive this will not succeed.In that case use:

    See for how to cleanup the source datacenter when it becomes available again.

    • Verify that synchronization has completely stopped using:

    All ArangoSync tasks send out heartbeat messages out to the other datacenterto indicate “it is still alive”. The other datacenter assumes the connection is“out of sync” when it does not receive any messages for a certain period of time.

    To do so, on both datacenters, run:

    The last argument is the period that ArangoSync should hold-off resynchronization for.This can be minutes (e.g. 15m) or hours (e.g. 3h).

    If maintenance is taking longer than expected, you can use the same command the extendthe hold-off period (e.g. to 4h).

    After the maintenance, use the same command restore the hold-off period to itsdefault of 1h.

    What to do in case of a document that exceeds the message queue limits

    If you insert/update a document in a collection and the size of that documentis larger than the maximum message size of your message queue, the collectionwill no longer be able to synchronize. It will go into a failed state.

    To recover from that, first remove the document from the ArangoDB clusterin the source datacenter. After that, for each failed shard, run: