GROUP BY Clause
While aggregate functions are defined as
group_expression
Specifies the criteria based on which the rows are grouped together. The grouping of rows is performed based on result values of the grouping expressions. A grouping expression may be a column name like
GROUP BY a
, a column position likeGROUP BY 0
, or an expression likeGROUP BY a + b
.grouping_set
A grouping set is specified by zero or more comma-separated expressions in parentheses. When the grouping set has only one element, parentheses can be omitted. For example,
GROUPING SETS ((a), (b))
is the same asGROUPING SETS (a, b)
.Syntax:
{ ( [ expression [ , ... ] ] ) | expression }
-
Groups the rows for each grouping set specified after GROUPING SETS. For example,
GROUP BY GROUPING SETS ((warehouse), (product))
is semantically equivalent to union of results ofGROUP BY warehouse
andGROUP BY product
. This clause is a shorthand for aUNION ALL
where each leg of theUNION ALL
operator performs aggregation of each grouping set specified in theGROUPING SETS
clause. Similarly,GROUP BY GROUPING SETS ((warehouse, product), (product), ())
is semantically equivalent to the union of results ofGROUP BY warehouse, product
,GROUP BY product
and global aggregate.Note: For Hive compatibility Spark allows
GROUP BY ... GROUPING SETS (...)
. The GROUP BY expressions are usually ignored, but if it contains extra expressions than the GROUPING SETS expressions, the extra expressions will be included in the grouping expressions and the value is always null. For example, , the output of columnc
is always null. CUBE
CUBE
clause is used to perform aggregations based on combination of grouping columns specified in theGROUP BY
clause.CUBE
is a shorthand forGROUPING SETS
. For example,GROUP BY warehouse, product WITH CUBE
orGROUP BY CUBE(warehouse, product)
is equivalent toGROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
.GROUP BY CUBE(warehouse, product, (warehouse, location))
is equivalent toGROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())
. The N elements of aCUBE
specification results in 2^NGROUPING SETS
.-
A GROUP BY clause can include multiple
group_expression
s and multiple s.GROUPING SETS
can also have nestedCUBE|ROLLUP|GROUPING SETS
clauses, e.g.GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))
,GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))
.CUBE|ROLLUP
is just a syntax sugar forGROUPING SETS
, please refer to the sections above for how to translateCUBE|ROLLUP
toGROUPING SETS
.group_expression
can be treated as a single-groupGROUPING SETS
under this context. For multipleGROUPING SETS
in theGROUP BY
clause, we generate a singleGROUPING SETS
by doing a cross-product of the originalGROUPING SETS
s. For nestedGROUPING SETS
in theGROUPING SETS
clause, we simply take its grouping sets and strip it. For example,GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
andGROUP BY warehouse, ROLLUP(product), CUBE(location, size)
is equivalent toGROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))
.GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
is equivalent toGROUP BY GROUPING SETS((warehouse), (warehouse, product))
. aggregate_name
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
DISTINCT
Removes duplicates in input rows before they are passed to aggregate functions.
-
Filters the input rows for which the
boolean_expression
in the clause evaluates to true are passed to the aggregate function; other rows are discarded.