Syntax

The query in the format SELECT * FROM table GROUP BY k1, k2, ... returns all columns listed in GROUP BY, i.e., is equivalent to SELECT DISTINCT k1, k2, ... FROM table.

An asterisk can also be used as an argument for the COUNT aggregate function. COUNT(*) means “the count of rows in the group”.

Note

Aggregate functions ignore NULL in their arguments, except for COUNT.

YQL also provides aggregation factories implemented by the functions AGGREGATION_FACTORY and .

Examples

  1. SELECT key, COUNT(*) FROM my_table
  2. GROUP BY key;

GROUP BY - 图2

  1. SELECT double_key, COUNT(*) FROM my_table
  2. GROUP BY key + key AS double_key;

GROUP BY - 图4

Warning

Specifying a name for a column or expression in GROUP BY .. AS foo it is an extension on top of YQL. Such a name becomes visible in WHERE despite the fact that filtering by WHERE is executed before the grouping. For example, if the T table includes two columns, foo and bar, then the query SELECT foo FROM T WHERE foo > 0 GROUP BY bar AS foo would actually filter data by the bar column from the source table.

GROUP BY … SessionWindow()

YQL supports grouping by session. To standard expressions in GROUP BY, you can add a special SessionWindow function:

  1. SELECT
  2. user,
  3. session_start,
  4. SessionStart() AS same_session_start, -- It's same as session_start
  5. COUNT(*) AS session_size,
  6. SUM(value) AS sum_over_session,
  7. FROM my_table
  8. GROUP BY user, SessionWindow(<time_expr>, <timeout_expr>) AS session_start

The following happens in this case:

  1. The input table is partitioned by the grouping keys specified in GROUP BY, ignoring SessionWindow (in this case, it’s based on user).
    If GROUP BY includes nothing more than SessionWindow, then the input table gets into one partition.
  2. Each partition is split into disjoint subsets of rows (sessions).
    For this, the partition is sorted in the ascending order of the time_expr expression.
    The session limits are drawn between neighboring items of the partition, that differ in their time_expr values by more than timeout_expr.
  3. The sessions obtained in this way are the final partitions on which aggregate functions are calculated.

The SessionWindow() key column (in the example, it’s session_start) has the value “the minimum time_expr in the session”.
If GROUP BY includes SessionWindow(), you can use a special aggregate function
SessionStart.

An extended version of SessionWindow with four arguments is also supported:

SessionWindow(<order_expr>, <init_lambda>, <update_lambda>, <calculate_lambda>)

Where:

  • <order_expr>: An expression used to sort the source partition
  • <init_lambda>: A lambda function to initialize the state of session calculation. It has the signature (TableRow())->State. It’s called once for the first (following the sorting order) element of the source partition
  • : A lambda function to update the status of session calculation and define the session limits. It has the signature (TableRow(), State)->Tuple<Bool, State>. It’s called for every item of the source partition, except the first one. The new value of state is calculated based on the current row of the table and the previous state. If the first item in the return tuple is True, then a new session starts from the current row. The key of the new session is obtained by applying <calculate_lambda> to the second item in the tuple.
  • <calculate_lambda>: A lambda function for calculating the session key (the “value” of SessionWindow() that is also accessible via SessionStart()). The function has the signature (TableRow(), State)->SessionKey. It’s called for the first item in the partition (after <init_lambda>) and those items for which <update_lambda> has returned True in the first item in the tuple. Please note that to start a new session, you should make sure that <calculate_lambda> has returned a value different from the previous session key. Sessions having the same keys are not merged. For example, if <calculate_lambda> returns the sequence 0, 1, 0, 1, then there will be four different sessions.

Using the extended version of SessionWindow, you can, for example, do the following: divide a partition into sessions, as in the SessionWindow use case with two arguments, but with the maximum session length limited by a certain constant:

Example

  1. $max_len = 1000; is the maximum session length.
  2. $init = ($row) -> (AsTuple($row.ts, $row.ts)); is the session status: tuple from 1) value of the temporary column ts in the session's first line and 2) in the current line
  3. $update = ($row, $state) -> {
  4. $is_end_session = $row.ts - $state.0 > $max_len OR $row.ts - $state.1 > $timeout;
  5. $new_state = AsTuple(IF($is_end_session, $row.ts, $state.0), $row.ts);
  6. return AsTuple($is_end_session, $new_state);
  7. };
  8. $calculate = ($row, $state) -> ($row.ts);
  9. SELECT
  10. user,
  11. session_start,
  12. SessionStart() AS same_session_start, -- It's same as session_start
  13. COUNT(*) AS session_size,
  14. SUM(value) AS sum_over_session,
  15. FROM my_table
  16. GROUP BY user, SessionWindow(ts, $init, $update, $calculate) AS session_start

You can use SessionWindow in GROUP BY only once.

The results of calculating the aggregate function as subtotals for the groups and overall totals over individual columns or whole table.

Syntax

GROUP BY - 图7

  • ROLLUP groups the column values in the order they are listed in the arguments (strictly from left to right), generates subtotals for each group and the overall total.
  • CUBE groups the values for every possible combination of columns, generates the subtotals for each group and the overall total.
  • GROUPING SETS sets the groups for subtotals.

You can combine ROLLUP, CUBE and GROUPING SETS, separating them by commas.

The values of columns not used in calculations are replaced with NULL in the subtotal. In the overall total, the values of all columns are replaced by NULL. GROUPING: A function that allows you to distinguish the source NULL values from the NULL values added while calculating subtotals and overall totals.

  • 0: If NULL is used for the original empty value.
  • 1: If NULL is added for a subtotal or overall total.

Example

  1. SELECT
  2. column1,
  3. column3,
  4. CASE GROUPING(
  5. column1,
  6. column2,
  7. column3,
  8. )
  9. WHEN 1 THEN "Subtotal: column1 and column2"
  10. WHEN 3 THEN "Subtotal: column1"
  11. WHEN 6 THEN "Subtotal: column3"
  12. WHEN 7 THEN "Grand total"
  13. ELSE "Individual group"
  14. END AS subtotal,
  15. COUNT(*) AS rows_count
  16. FROM my_table
  17. GROUP BY
  18. ROLLUP(
  19. column1,
  20. column2,
  21. column3
  22. ),
  23. GROUPING SETS(
  24. (column2, column3),
  25. (column3)
  26. -- if you add here (column2) as well, then together
  27. -- the ROLLUP and GROUPING SETS would produce a result
  28. -- similar to CUBE
  29. )
  30. ;

DISTINCT

Applying aggregate functions only to distinct values of the column.

Note

Applying DISTINCT to calculated values is not currently implemented. For this purpose, you can use a or the expression GROUP BY ... AS ....

Example

  1. SELECT
  2. key,
  3. COUNT (DISTINCT value) AS count -- top 3 keys by the number of unique values
  4. FROM my_table
  5. GROUP BY key
  6. ORDER BY count DESC
  7. LIMIT 3;

GROUP BY - 图9

You can also use DISTINCT to fetch distinct rows using SELECT DISTINCT.

Improves aggregation efficiency if the query author knows in advance that none of aggregation keys finds large amounts of data (i.e., with the order of magnitude exceeding a gigabyte or a million of rows). If this assumption fails to materialize, then the operation may fail with Out of Memory error or start running much slower compared to the non-COMPACT version.

Unlike the usual GROUP BY, the Map-side combiner stage and additional Reduce are disabled for each field with aggregation.

Example:

HAVING

Filtering a SELECT based on the calculation results of . The syntax is similar to WHERE.

Example

  1. SELECT
  2. key
  3. FROM my_table
  4. HAVING COUNT(value) > 100;