Launching ArangoDB’s standalone Agency

    What is a fault-tolerant state machine in the first place?

    In many service deployments consisting of arbitrary components distributed over multiple machines one is faced with the challenge of creating a dependable centralized knowledge base or configuration. Implementation of such a service turns out to be one of the most fundamental problems in information engineering. While it may seem as if the realization of such a service is easily conceivable, dependability formulates a paradox on computer networks per se. On the one hand, one needs a distributed system to avoid a single point of failure. On the other hand, one has to establish consensus among the computers involved.

    Consensus is the keyword here and its realization on a network proves to be far from trivial. Many papers and conference proceedings have discussed and evaluated this key challenge. Two algorithms, historically far apart, have become widely popular, namely Paxos and its derivatives and Raft. Discussing them and their differences, although highly enjoyable, must remain far beyond the scope of this document. Find the references to the main publications at the bottom of this page.

    At ArangoDB, we decided to implement Raft as it is arguably the easier to understand and thus implement. In simple terms, Raft guarantees that a linear stream of transactions, is replicated in realtime among a group of machines through an elected leader, who in turn must have access to and project leadership upon an overall majority of participating instances. In ArangoDB we like to call the entirety of the components of the replicated transaction log, that is the machines and the ArangoDB instances, which constitute the replicated log, the Agency.

    The Agency must consists of an odd number of Agents in order to be able to establish an overall majority and some means for the Agents to be able to find one another at startup.

    The most obvious way would be to inform all Agents of the addresses and ports of the rest. This however, is more information than needed. For example, it would suffice, if all Agents would know the address and port of the next Agent in a cyclic fashion. Another straightforward solution would be to inform all Agents of the address and port of say the first Agent.

    Clearly all cases, which would form disjunct subsets of Agents would break or in the least impair the functionality of the Agency. From there on the Agents will gossip the missing information about their peers.

    The below commands start up a 3-host Agency on one physical/logical box with ports 8531, 8541 and 8551 for demonstration purposes. The address of the first instance, port 8531, is known to the other two. After at most 2 rounds of gossipping, the last 2 Agents will have a complete picture of their surroundings and persist it for the next restart.

    The parameter is the key ingredient for the second and third instances to find the first instance and thus form a complete Agency. Please refer to the shell-script scripts/ on GitHub or in the source directory.

    The Agency should be up and running within a couple of seconds, during which the instances have gossiped their way into knowing the other Agents and elected a leader. The public API can be checked for the state of the configuration:

    1. curl -s localhost:8531/_api/agency/config
    1. {
    2. "term": 1,
    3. "leaderId": "AGNT-cec78b63-f098-4b4e-a157-a7bebf7947ba",
    4. "commitIndex": 1,
    5. "lastCompactionAt": 0,
    6. "nextCompactionAfter": 1000,
    7. "lastAcked": {
    8. "AGNT-cec78b63-f098-4b4e-a157-a7bebf7947ba": {
    9. "lastAckedTime": 0,
    10. "lastAckedIndex": 1
    11. },
    12. "AGNT-5c8d92ed-3fb5-4886-8990-742ddb4482fa": {
    13. "lastAckedTime": 0.167,
    14. "lastAckedIndex": 1,
    15. "lastAppend": 15.173
    16. },
    17. "AGNT-f6e79b6f-d55f-4ae5-a5e2-4c2d6272b0b8": {
    18. "lastAckedTime": 0.167,
    19. "lastAckedIndex": 1,
    20. "lastAppend": 15.173
    21. }
    22. },
    23. "configuration": {
    24. "pool": {
    25. "AGNT-cec78b63-f098-4b4e-a157-a7bebf7947ba": "tcp://localhost:8531",
    26. "AGNT-5c8d92ed-3fb5-4886-8990-742ddb4482fa": "tcp://localhost:8541"
    27. },
    28. "active": [
    29. "AGNT-5c8d92ed-3fb5-4886-8990-742ddb4482fa",
    30. "AGNT-cec78b63-f098-4b4e-a157-a7bebf7947ba"
    31. ],
    32. "id": "AGNT-cec78b63-f098-4b4e-a157-a7bebf7947ba",
    33. "agency size": 3,
    34. "pool size": 3,
    35. "endpoint": "tcp://localhost:8531",
    36. "min ping": 1,
    37. "max ping": 5,
    38. "timeoutMult": 1,
    39. "supervision": false,
    40. "supervision frequency": 1,
    41. "compaction step size": 1000,
    42. "compaction keep size": 50000,
    43. "supervision grace period": 10,
    44. "version": 4,
    45. "startup": "origin"
    46. },
    47. "engine": "rocksdb",
    48. "version": "3.4.3"
    49. }

    To highlight some details in the above output look for "term" and "leaderId". Both are key information about the current state of the Raft algorithm. You may have noted that the first election term has established a random leader for the Agency, who is in charge of replication of the state machine and for all external read and write requests until such time that the process gets isolated from the other two subsequently losing its leadership.

    Generally, all read and write accesses are transactions moreover any read and write access may consist of multiple such transactions formulated as arrays of arrays in JSON documents.

    An Agency started from scratch will deal with the simplest query as follows:

    1. [{}]

    The above request for an empty key value store will return with an empty document. The inner array brackets will aggregate a result from multiple sources in the key-value-store while the outer array will deliver multiple such aggregated results. Also note the -L curl flag, which allows the request to follow redirects to the current leader.

    1. {
    2. "baz": 12,
    3. "corge": {
    4. "pi": 3.14159265359
    5. },
    6. "bar": "Hello World"
    7. },
    8. "qux": {
    9. "quux": "Hello World"
    10. }
    11. }

    The following array of read transactions will yield:

    1. [
    2. {
    3. "baz": 12,
    4. "foo": {
    5. "bar": "Hello World"
    6. }
    7. },
    8. {
    9. "qux": {
    10. "quux": "Hello World"
    11. }
    12. }
    13. ]

    Note that the result is an array of two results for the first and second read transactions from above accordingly. Also note that the results from the first read transaction are aggregated into

    1. {
    2. "baz": 12,
    3. "foo": {
    4. "bar": "Hello World"
    5. }
    6. }

    The aggregation is performed on 2 levels:

    • /foo/bar is eliminated as a subset of /foo
    • The results from /foo and /bar are joinedThe word transaction means here that it is guaranteed that all aggregations happen in quasi-realtime and that no write access could have happened in between.

    Btw, the same transaction on the virgin key-value store would produce [{},{}]

    The write API must unfortunately be a little more complex. Multiple roads lead to Rome:


      are equivalent for example and will create and fill an array at . Here, again, the outermost array is the container for the transaction arrays.