SmartJoins

SmartJoins are only available in the,also available as managed service.

SmartJoins allow to execute co-located join operations among identicallysharded collections.

ArangoSearch Views are eligible for SmartJoins, provided that their underlyingcollections are eligible too (introduced in v3.6.0).

When doing joins in an ArangoDB cluster, data has to be exchanged between different servers.Joins between different collections in a cluster normally require roundtrips between the shards of these collections for fetching the data. Requests are routed through an extraCoordinator hop.

For example, with two collections c1 and c2 with 4 shards each, the Coordinator will initially contact the 4 shards of c1. In order to perform the join, the DB-Server nodes which manage the actual data of c1 need to pull the data from the other collection, c2. This causes extra roundtrips via the Coordinator, which will then pull the data for c2 from the responsible shards:

This is the general query execution, and it makes sense if there is no furtherinformation available about how the data is actually distributed to the individualshards. It works in case c1 and c2 have a different amount of shards, or use different shard keys or strategies. However, it comes with the additional cost of having to do 4 x 4 requests to perform the join.

Sharding two collections identically using distributeShardsLike

In the specific case that the two collections have the same number of shards, the data of the two collections can be co-located on the same server for the same shard key values. In this case the extra hop via the Coordinator will not be necessary.

The query optimizer will remove the extra hop for the join in case it can provethat data for the two collections is co-located.

Here is an example setup for this, using arangosh:

arangosh> db._create("c1", {numberOfShards: 4, shardKeys: ["_key"]});
arangosh> db._create("c2", {shardKeys: ["_key"], distributeShardsLike: "c1"});

Now the collections c1 and c2 will not only have the same shard keys, but theywill also locate their data for the same shard keys values on the same server.

Let’s check how the data actually gets distributed now. We first confirm that thetwo collections have 4 shards each, which in this example are evenly distributedacross two servers:

arangosh> db.c1.shards(true)
{ 
  "s2011661" : [ 
    "PRMR-64d19f43-3aa0-4abb-81f6-4b9966d32175" 
  ], 
  "s2011662" : [ 
    "PRMR-5f30caa0-4c93-4fdd-98f3-a2130c1447df" 
  ], 
  "s2011663" : [ 
    "PRMR-64d19f43-3aa0-4abb-81f6-4b9966d32175" 
  ], 
  "s2011664" : [ 
    "PRMR-5f30caa0-4c93-4fdd-98f3-a2130c1447df" 
  ] 
}
arangosh> db.c2.shards(true)
{ 
  "s2011666" : [ 
    "PRMR-64d19f43-3aa0-4abb-81f6-4b9966d32175" 
  ], 
  "s2011667" : [ 
  ], 
  "s2011668" : [ 
    "PRMR-64d19f43-3aa0-4abb-81f6-4b9966d32175" 
  ], 
  "s2011669" : [ 
  ] 
}

Because we have told both collections that distribute their data alike, theirshards will now also be populated alike:

We can see that shard 1 of c1 (“s2011664”) has the same number of documents asshard 1 of c2 (“s20116692), that shard 2 of c1 (“s2011661”) has the samenumber of documents as shard2 of c2 (“s2011666”) etc.Additionally, we can see from the shard-to-server distribution above that thecorresponding shards from c1 and c2 always reside on the same node.This is a precondition for running joins locally, and thanks to the effects ofdistributeShardsLike it is now satisfied!

With the two collections in place like this, an AQL query that uses a FILTER conditionthat refers from the shard key of the one collection to the shard key of the other collectionand compares the two shard key values by equality is eligible for the queryoptimizer’s “smart-joins” optimization:

  arangosh> db._explain("FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2._key RETURN doc1");
  Query String:
   FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2._key RETURN doc1
  Execution plan:
   Id   NodeType                  Site  Est.   Comment
    1   SingletonNode             DBS      1   * ROOT
    3   EnumerateCollectionNode   DBS      0     - FOR doc2 IN c2   /* full collection scan, 4 shard(s) */
    7   IndexNode                 DBS      0       - FOR doc1 IN c1   /* primary index scan, 4 shard(s) */
   10   RemoteNode                COOR     0         - REMOTE
   11   GatherNode                COOR     0         - GATHER 
    6   ReturnNode                COOR     0         - RETURN doc1

As can be seen above, the extra hop via the Coordinator is gone here, which will meanless cluster-internal traffic and a faster response time.

SmartJoins will also work if the shard key of the second collection is not _key,and even for non-unique shard key values, e.g.:

arangosh> db._create("c1", {numberOfShards: 4, shardKeys: ["_key"]});
arangosh> db._create("c2", {shardKeys: ["parent"], distributeShardsLike: "c1"});
arangosh> db.c2.ensureIndex({ type: "hash", fields: ["parent"] });
arangosh> for (i = 0; i < 100; ++i) { 
  db.c1.insert({ _key: "test" + i }); 
  for (j = 0; j < 10; ++j) {
    db.c2.insert({ parent: "test" + i });
  }
}
arangosh> db._explain("FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2.parent RETURN doc1");
Query String:
 FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2.parent RETURN doc1
Execution plan:
 Id   NodeType                  Site  Est.   Comment
  1   SingletonNode             DBS      1   * ROOT
  3   EnumerateCollectionNode   DBS   2000     - FOR doc2 IN c2   /* full collection scan, 4 shard(s) */
  7   IndexNode                 DBS   2000       - FOR doc1 IN c1   /* primary index scan, 4 shard(s) */
 10   RemoteNode                COOR  2000         - REMOTE
 11   GatherNode                COOR  2000         - GATHER 
  6   ReturnNode                COOR  2000         - RETURN doc1

SmartJoins using smartJoinAttribute

In case the join on the second collection must be performed on a non-shard keyattribute, there is the option to specify a smartJoinAttribute for the collection.Note that for this case, setting distributeShardsLike is still required here, and thatonly a single shardKeys attribute can be used.The single attribute name specified in the shardKeys attribute for the collection must endwith a colon character then.

This smartJoinAttribute must be populated for all documents in the collection,and must always contain a string value. The value of the _key attribute for eachdocument must consist of the value of the smartJoinAttribute, a colon characterand then some other user-defined key component.

The setup thus becomes:

Failure to populate the smartJoinAttribute with a string or not at all will leadto a document being rejected on insert, update or replace. Similarly, failure toprefix a document’s _key attribute value with the value of the _smartJoinAttribute_will also lead to the document being rejected:

arangosh> db.c2.insert({ parent: 123 });
JavaScript exception in file './js/client/modules/@arangodb/arangosh.js' at 99,7: ArangoError 4008: smart join attribute not given or invalid
arangosh> db.c2.insert({ _key: "123:test1", parent: "124" });
JavaScript exception in file './js/client/modules/@arangodb/arangosh.js' at 99,7: ArangoError 4007: shard key value must be prefixed with the value of the smart join attribute

The join can now be performed via the collection’s smartJoinAttribute:

arangosh> db._explain("FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2.parent RETURN doc1")
Query String:
 FOR doc1 IN c1 FOR doc2 IN c2 FILTER doc1._key == doc2.parent RETURN doc1
Execution plan:
 Id   NodeType                  Site  Est.   Comment
  1   SingletonNode             DBS      1   * ROOT
  3   EnumerateCollectionNode   DBS    101     - FOR doc2 IN c2   /* full collection scan, 4 shard(s) */
  7   IndexNode                 DBS    101       - FOR doc1 IN c1   /* primary index scan, 4 shard(s) */
 10   RemoteNode                COOR   101         - REMOTE
 11   GatherNode                COOR   101         - GATHER

If a FILTER condition is used on one of the shard keys, the optimizer will also tryto restrict the queries to just the required shards:

Limitations

The SmartJoins optimization is currently triggered only for data selection queries,but not for any data-manipulation operations such as INSERT, UPDATE, REPLACE, REMOVEor UPSERT, neither traversals, subqueries or views.

It will only be applied when joining two collections with an identical sharding setup. This requires the second collection to be created with its distributeShardsLike attribute pointing to the first collection.

Finally, the SmartJoins optimization requires that the collections are joined on theirshard key attributes (or smartJoinAttribute) using an equality comparison.