Speeding up slave initialization

    For the following example setup, we will use the instance with endpoint as master, and the instance with endpointtcp://slave.domain.org:8530 as slave.

    The goal is to have all data from the database _system<code> on master replicated to the database </code>_system on the slave (the same process can be applied forother databases).

    First of all you have to start the master server, using a command like the above:

    Depending on your storage engine you also want to adjust the following options:

    • MMFiles:—wal.historic-logfilesmaximum number of historic logfiles to keep after collection (default: 10)

    • RocksDB:—rocksdb.wal-file-timeouttimeout after which unused WAL files are deleted in seconds (default: 10)

    The options above prevent the premature removal of old WAL files from the master,and are useful in case intense write operations happen on the master while youare initializing the slave. In fact, if you do not tune these options, what canhappen is that the master WAL files do not include all the write operationshappened after the backup is taken. This may lead to situations in which theinitialized slave is missing some data, or fails to start.

    Now you have to create a dump from the master using the tool arangodump:

    1. arangodump --output-directory "dump" --server.endpoint tcp://master.domain.org:8529

    The following is a possible arangodump output:

    1. Server version: 3.3
    2. Connected to ArangoDB 'tcp://master.domain.org:8529', database: '_system', username: 'root'
    3. Writing dump to output directory 'dump'
    4. Last tick provided by server is: 37276350
    5. # Dumping document collection 'TestNums'...
    6. # Dumping document collection 'TestNums2'...
    7. # Dumping document collection 'frenchCity'...
    8. # Dumping document collection 'germanCity'...
    9. # Dumping document collection 'persons'...
    10. # Dumping edge collection 'frenchHighway'...
    11. # Dumping edge collection 'germanHighway'...
    12. # Dumping edge collection 'internationalHighway'...
    13. # Dumping edge collection 'knows'...
    14. Processed 9 collection(s), wrote 1298855504 byte(s) into datafiles, sent 32 batch(es)

    In line 4 the last server tick is displayed. This value will be useful whenwe will start the replication, to have the replication-applier startreplicating exactly from that tick.

    Next you have to start the slave:

    1. arangod --server.endpoint tcp://slave.domain.org:8530

    If you are running master and slave on the same server (just for test), pleasemake sure you give your slave a different data directory.

    Now you are ready to restore the dump with the tool arangorestore:

    Again, please adapt the command above in case you are using a database differentthan _system.

    Once the restore is finished there are two possible approaches to start thereplication:

    • with sync check (slower, but easier)
    • without sync check (faster, but last server tick needs to be provided correctly)

    Start replication on the slave with arangosh using the following command:

    1. arangosh --server.endpoint tcp://slave.domain.org:8530
    1. db._useDatabase("_system");
    2. require("@arangodb/replication").setupReplication({
    3. endpoint: "tcp://master.domain.org:8529",
    4. username: "myuser",
    5. password: "mypasswd",
    6. verbose: false,
    7. includeSystem: false,
    8. incremental: true,
    9. autoResync: true
    10. });
    1. still synchronizing... last received status: 2017-12-06T14:06:25Z: fetching collection keys for collection 'TestNums' from /_api/replication/keys/keys?collection=7173693&to=57482456&serverId=24282855553110&batchId=57482462
    2. still synchronizing... last received status: 2017-12-06T14:06:25Z: fetching collection keys for collection 'TestNums' from /_api/replication/keys/keys?collection=7173693&to=57482456&serverId=24282855553110&batchId=57482462
    3. still synchronizing... last received status: 2017-12-06T14:07:13Z: sorting 10000000 local key(s) for collection 'TestNums'
    4. still synchronizing... last received status: 2017-12-06T14:07:13Z: sorting 10000000 local key(s) for collection 'TestNums'
    5. [...]
    6. still synchronizing... last received status: 2017-12-06T14:09:10Z: fetching master collection dump for collection 'TestNums3', type: document, id 37276943, batch 2, markers processed: 15278, bytes received: 2097258
    7. still synchronizing... last received status: 2017-12-06T14:09:18Z: fetching master collection dump for collection 'TestNums5', type: document, id 37276973, batch 5, markers processed: 123387, bytes received: 17039688
    8. [...]
    9. still synchronizing... last received status: 2017-12-06T14:13:49Z: fetching master collection dump for collection 'TestNums5', type: document, id 37276973, batch 132, markers processed: 9641823, bytes received: 1348744116
    10. still synchronizing... last received status: 2017-12-06T14:13:59Z: fetching collection keys for collection 'frenchCity' from /_api/replication/keys/keys?collection=27174045&to=57482456&serverId=24282855553110&batchId=57482462
    11. {
    12. "state" : {
    13. "running" : true,
    14. "lastAppliedContinuousTick" : null,
    15. "lastProcessedContinuousTick" : null,
    16. "lastAvailableContinuousTick" : null,
    17. "safeResumeTick" : null,
    18. "progress" : {
    19. "time" : "2017-12-06T14:13:59Z",
    20. "message" : "send batch finish command to url /_api/replication/batch/57482462?serverId=24282855553110",
    21. "failedConnects" : 0
    22. },
    23. "totalRequests" : 0,
    24. "totalFailedConnects" : 0,
    25. "totalEvents" : 0,
    26. "totalOperationsExcluded" : 0,
    27. "lastError" : {
    28. "errorNum" : 0
    29. },
    30. "time" : "2017-12-06T14:13:59Z"
    31. },
    32. "server" : {
    33. "version" : "3.3.devel",
    34. "serverId" : "24282855553110"
    35. },
    36. "endpoint" : "tcp://master.domain.org:8529",
    37. "database" : "_system"
    38. }

    This is the same command that you would use to start replication even withouttaking a backup first. The difference, in this case, is that the data that ispresent already on the slave (and that has been restored from the backup) thistime is not transferred over the network from the master to the slave.

    The command above will only check that the data already included in the slaveis in sync with the master. After this check, the replication-applier willmake sure that all write operations that happened on the master after thebackup are replicated on the slave.

    While this approach is definitely faster than transferring the whole databaseover the network, since a sync check is performed, it can still require some time.

    Approach 2: Apply replication by tick

    In this approach, the sync check described above is not performed. As a resultthis approach is faster as the existing slave data is not checked.Write operations are executed starting from the tick you provide and continuewith the master’s available ticks.

    This is still a secure way to start replication as far as the correct tickis passed.

    As previously mentioned the last tick provided by the master is displayedwhen using arangodump. In our example the last tick was 37276350.

    First of all you have to apply the properties of the replication, using on the slave:

    1. require("@arangodb/replication").applier.properties({
    2. endpoint: "tcp://master.domain.org:8529",
    3. username: "myuser",
    4. password: "mypasswd",
    5. verbose: false,
    6. includeSystem: false,
    7. incremental: true,
    8. autoResync: true});

    Then you can start the replication with the last provided logtick of themaster (output of arangodump):

    1. require("@arangodb/replication").applier.start(37276350)
    1. {
    2. "state" : {
    3. "running" : true,
    4. "lastAppliedContinuousTick" : null,
    5. "lastProcessedContinuousTick" : null,
    6. "lastAvailableContinuousTick" : null,
    7. "safeResumeTick" : null,
    8. "progress" : {
    9. "time" : "2017-12-06T13:26:04Z",
    10. "message" : "applier initially created for database '_system'",
    11. "failedConnects" : 0
    12. },
    13. "totalRequests" : 0,
    14. "totalFailedConnects" : 0,
    15. "totalEvents" : 0,
    16. "totalOperationsExcluded" : 0,
    17. "lastError" : {
    18. "errorNum" : 0
    19. },
    20. "time" : "2017-12-06T13:33:25Z"
    21. },
    22. "server" : {
    23. "version" : "3.3.devel",
    24. "serverId" : "176090204017635"
    25. },
    26. "endpoint" : "tcp://master.domain.org:8529",
    27. "database" : "_system"

    After the replication has been started with the command above, you can use theapplier.state command to check how far the last applied tick on the slaveis far from the last available master :