6.20. HyperLogLog Functions

    Presto implements HyperLogLog data sketches as a set of 32-bit buckets whichstore a maximum hash. They can be stored sparsely (as a map from bucket IDto bucket), or densely (as a contiguous memory block). The HyperLogLog datastructure starts as the sparse representation, switching to dense when it ismore efficient. The P4HyperLogLog structure is initialized densely andremains dense for its lifetime.

    HyperLogLog implicitly casts to ,while one can explicitly cast HyperLogLog to P4HyperLogLog:

    Data sketches can be serialized to and deserialized from varbinary. Thisallows them to be stored for later use. Combined with the ability to mergemultiple sketches, this allows one to calculate approx_distinct() of theelements of a partition of a query, then for the entirety of a query with verylittle cost.

    For example, calculating the HyperLogLog for daily unique users will allowweekly or monthly unique users to be calculated incrementally by combining thedailies. This is similar to computing weekly revenue by summing daily revenue.Uses of with GROUPING SETS can be converted to useHyperLogLog. Examples:

    1. CREATE TABLE visit_summaries (
    2. visit_date date,
    3. hll varbinary
    4. );
    5.  
    6. SELECT visit_date, cast(approx_set(user_id) AS varbinary)
    7. FROM user_visits
    8. GROUP BY visit_date;
    9.  
    10. SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS weekly_unique_users
    11. FROM visit_summaries
    12. WHERE visit_date >= current_date - interval '7' day;
    • approxset(_x) → HyperLogLog
    • approxset(_x, e) → HyperLogLog

    • Returns the HyperLogLog sketch of the input data set of x, witha maximum standard error of e. The current implementation of thisfunction requires that e be in the range of [0.0040625, 0.26000].This data sketch underlies approx_distinct() and can be stored andused later by calling cardinality().

    • cardinality(hll) → bigint

    • This will perform on the data summarized by thehll HyperLogLog data sketch.

    • Returns an empty HyperLogLog.The value of the maximum standard error is defaulted to 0.01625.

    • emptyapprox_set(_e) → HyperLogLog

    • Returns an empty HyperLogLog with a maximum standard error of e.The current implementation of this function requires that e be inthe range of [0.0040625, 0.26000].

    • Returns the HyperLogLog of the aggregate union of the individual HyperLogLog structures.