6.3. Cost based optimizations
The order in which joins are executed in a query can have a significant impact on the query’s performance. The aspect of join ordering that has the largest impact on performance is the size of the data being processed and transferred over the network. If a join that produces a lot of data is performed early in the execution, then subsequent stages will need to process large amounts of data for longer than necessary, increasing the time and resources needed for the query.
With cost based join enumeration, Presto uses cdoc:/optimizer/statistics provided by connectors to estimate the costs for different join orders and automatically pick the join order with the lowest computed costs.
The join enumeration strategy is governed by the session property, with the optimizer.join-reordering-strategy
configuration property providing the default value.
ELIMINATE_CROSS_JOINS
(default) - eliminate unnecessary cross joins- - purely syntactic join order
If using AUTOMATIC
and statistics are not available, or if for any other reason a cost could not be computed, the ELIMINATE_CROSS_JOINS
strategy is used instead.
Presto uses a hash based join algorithm. That implies that for each join operator a hash table must be created from one join input (called build side). The other input (probe side) is then iterated and for each row the hash table is queried to find matching rows.
There are two types of join distributions:
- Partitioned: each node participating in the query builds a hash table from only a fraction of the data
With cost based join distribution selection, Presto automatically chooses whether to use a partitioned or broadcast join. With both cost based join enumeration and cost based join distribution, Presto automatically chooses which side is the probe and which is the build.
The join distribution type is governed by the join_distribution_type
session property, with the configuration property providing the default value.
The valid values are:
AUTOMATIC
- join distribution type is determined automatically for each joinBROADCAST
- broadcast join distribution is used for all joinsPARTITIONED
(default) - partitioned join distribution is used for all join