or why 2+2 is not always 4
WITH ROLLUP
is a GROUP BY
modifier that causes extra rows to be added to
the result set produced by the query. These rows represent higher-level (or super-aggregate)
summary operations.
A shallow examination of the effects of WITH ROLLUP
on a query like:
SELECT user_id, COUNT(product) AS nb_products, SUM(amount) AS total FROM cart GROUP BY user_id
can lead to the erroneous conclusion that the WITH ROLLUP
modifier is just a courtesy
of the database implementors to the front-end developers and the extra rows added by it
could be computed on the client-side as well. Big mistake!
A deeper investigation on the GROUP BY
aggregate functions reveals that there are
functions whose values generated by WITH ROLLUP
cannot be computed on the client side
using only the values returned by the GROUP BY
query. These are the statistical functions
(STDDEV_POP()
, VAR_POP()
a.s.o.) that are not used by many applications
but, surprisingly, also the more commonly used SUM()
and COUNT()
,
when they are combined with DISTINCT
.
Case Study
Let’s create a small MySQL table in the database and put some rows in it:
1 | CREATE TABLE `logins` ( |
It is a naïve log table that stores the customer login dates and the browsers they are using.
Let’s count the number of logins for each day:
1 | SELECT `date`, COUNT(`user_id`) AS nb |
The result is straightforward:
+------------+----+
| date | nb |
+------------+----+
| 2015-04-01 | 2 |
| 2015-04-02 | 3 |
| NULL | 5 |
+------------+----+
3 rows in set (0.01 sec)
The WITH ROLLUP
clause produced a nice total and doesn’t provide any surprise: 2 + 3 = 5
,
any first-grade child can assure you about the correctness of this result.
Where DISTINCT
comes into play
Now, let’s take a look to the raw table data and we’ll discover that on 2015-04-02
the first user logged in twice using two different browsers. Let’s change the query
to count the unique users for each day, i.e. count the first user only once for 2015-04-02
:
1 | SELECT `date`, COUNT(DISTINCT `user_id`) AS nb |
+------------+----+
| date | nb |
+------------+----+
| 2015-04-01 | 2 |
| 2015-04-02 | 2 |
| NULL | 3 |
+------------+----+
3 rows in set (0.00 sec)
Surprise, surprise! We have 2
unique visitors for each of the two days and a big total of 3
which is not quite the calculation one would expect.
Sure, going back to the raw table data we can see there are only 3
unique values for user_id
in the table and that makes it impossible to return 4
or more as the result of COUNT(DISTINCT)
.
This is easy to see at a glance on our example table but not that easy on a production table that
contains thousands or millions of records.
When 2+2 equals 3
In order to find out and understand why MySQL returns this result let’s
analyze how GROUP BY
, WITH ROLLUP
and DISTINCT
work.
Let’s find out what rows are selected in each of the queries above. This can be done
easily by removing (commenting out) the GROUP BY
clause (together with its
WITH ROLLUP
modifier) and removing the aggregate function COUNT()
. I also added the
PK
field to easily identify the rows in our table and ORDER BY
to easily identify
the groups.
All values (without DISTINCT
)
GROUP BY
1 | SELECT `id`, `date`, `user_id` |
+----+------------+---------+
| id | date | user_id |
+----+------------+---------+
| 1 | 2015-04-01 | 1 |
| 4 | 2015-04-01 | 2 |
+----+------------+---------+
| 2 | 2015-04-02 | 1 |
| 3 | 2015-04-02 | 1 |
| 5 | 2015-04-02 | 3 |
+----+------------+---------+
5 rows in set (0.00 sec)
No surprise here, the query returns all the rows from the table. We can see there
are 2 entries having date = '2015-04-01'
and 3 entries with date = '2015-04-02'
.
Grouping by date
, rows 1 and 4 are the first group, rows 2, 3 and 5 are the second
group.
WITH ROLLUP
The WITH ROLLUP
modifier computes an extra row that is some kind of super-group.
It aggregates all the rows returned by the query above, creating a single group, behaving
like all the rows had the same value in the date
column. There are 5 rows in the set,
this is why the extra row generated by WITH ROLLUP
has value 5
in the nb
column.
We can simulate the computation of the extra row by running a modified version of the
query (without the WITH ROLLUP
modifier) that selects NULL
for the date
column;
this makes the GROUP BY
clause create a single big group.
1 | SELECT NULL AS `null_date`, COUNT(`user_id`) AS nb |
We cannot use date
as the name of the selected expression because it will conflict
with the table column date
and GROUP BY `date`
will use the table column instead.
+-----------+----+
| null_date | nb |
+-----------+----+
| NULL | 5 |
+-----------+----+
1 row in set (0.00 sec)
Asking for DISTINCT
values
GROUP BY
Let’s try now to figure out what rows are used to compute the result set when the
original query contains COUNT(DISTINCT `user_id`)
.
For that we need to add the DISTINCT
modifier to the SELECT
clause and drop out
the PK
column because it guarantees all the rows are distinct and defeats the purpose
of our query (we are interested only in distict combinations of date
and user_id
).
1 | SELECT DISTINCT `date`, `user_id` |
The rows selected for grouping are:
+------------+---------+
| date | user_id |
+------------+---------+
| 2015-04-01 | 1 |
| 2015-04-01 | 2 |
+------------+---------+
| 2015-04-02 | 1 |
| 2015-04-02 | 3 |
+------------+---------+
4 rows in set (0.00 sec)
The PK column is not there but we can guess the missing row has id
= 3 or id
= 2
(it does not really matter). These two rows have identical values on the date
and
user_id
fields. The DISTINCT
modifier guarantees only one of them is selected
for further processing by the query.
There are 2 groups, each of them having exactly 2 rows. This is consistent with the
result returned by the GROUP BY
query.
WITH ROLLUP
If we replace the values of the date
column with NULL
s in the result set above
we will see there are only 3 distinct rows (out of 4).
Let’s use the same trick as above to simulate what WITH ROLLUP
does.
The modified query is:
1 | SELECT NULL AS `null_date`, COUNT(DISTINCT `user_id`) AS nb |
The result:
+-----------+----+
| null_date | nb |
+-----------+----+
| NULL | 3 |
+-----------+----+
1 row in set (0.00 sec)
Again, the result matches the value returned by MySQL.
Conclusion
In order to compute the extra rows added by the WITH ROLLUP
modifier, MySQL does not use
the rows already computed for the GROUP BY
query without rollup. In some cases they can be
used but many times they cannot. This makes the correct computation of the summary rows impossible
on the client side (without getting the raw rows, which defeats the entire purpose of using
GROUP BY
in the first place).