or why 2+2 is not always 4
WITH ROLLUP is a GROUP BY modifier that causes extra rows to be added to
the result set produced by the query. These rows represent higher-level (or super-aggregate)
summary operations.
A shallow examination of the effects of WITH ROLLUP on a query like:
SELECT user_id, COUNT(product) AS nb_products, SUM(amount) AS total FROM cart GROUP BY user_id
can lead to the erroneous conclusion that the WITH ROLLUP modifier is just a courtesy
of the database implementors to the front-end developers and the extra rows added by it
could be computed on the client-side as well. Big mistake!
A deeper investigation on the GROUP BY aggregate functions reveals that there are
functions whose values generated by WITH ROLLUP cannot be computed on the client side
using only the values returned by the GROUP BY query. These are the statistical functions
(STDDEV_POP(), VAR_POP() a.s.o.) that are not used by many applications
but, surprisingly, also the more commonly used SUM() and COUNT(),
when they are combined with DISTINCT.
Case Study
Let’s create a small MySQL table in the database and put some rows in it:
1 | CREATE TABLE `logins` ( |
It is a naïve log table that stores the customer login dates and the browsers they are using.
Let’s count the number of logins for each day:
1 | SELECT `date`, COUNT(`user_id`) AS nb |
The result is straightforward:
+------------+----+
| date | nb |
+------------+----+
| 2015-04-01 | 2 |
| 2015-04-02 | 3 |
| NULL | 5 |
+------------+----+
3 rows in set (0.01 sec)
The WITH ROLLUP clause produced a nice total and doesn’t provide any surprise: 2 + 3 = 5,
any first-grade child can assure you about the correctness of this result.
Where DISTINCT comes into play
Now, let’s take a look to the raw table data and we’ll discover that on 2015-04-02
the first user logged in twice using two different browsers. Let’s change the query
to count the unique users for each day, i.e. count the first user only once for 2015-04-02:
1 | SELECT `date`, COUNT(DISTINCT `user_id`) AS nb |
+------------+----+
| date | nb |
+------------+----+
| 2015-04-01 | 2 |
| 2015-04-02 | 2 |
| NULL | 3 |
+------------+----+
3 rows in set (0.00 sec)
Surprise, surprise! We have 2 unique visitors for each of the two days and a big total of 3
which is not quite the calculation one would expect.
Sure, going back to the raw table data we can see there are only 3 unique values for user_id
in the table and that makes it impossible to return 4 or more as the result of COUNT(DISTINCT).
This is easy to see at a glance on our example table but not that easy on a production table that
contains thousands or millions of records.
When 2+2 equals 3
In order to find out and understand why MySQL returns this result let’s
analyze how GROUP BY, WITH ROLLUP and DISTINCT work.
Let’s find out what rows are selected in each of the queries above. This can be done
easily by removing (commenting out) the GROUP BY clause (together with its
WITH ROLLUP modifier) and removing the aggregate function COUNT(). I also added the
PK field to easily identify the rows in our table and ORDER BY to easily identify
the groups.
All values (without DISTINCT)
GROUP BY
1 | SELECT `id`, `date`, `user_id` |
+----+------------+---------+
| id | date | user_id |
+----+------------+---------+
| 1 | 2015-04-01 | 1 |
| 4 | 2015-04-01 | 2 |
+----+------------+---------+
| 2 | 2015-04-02 | 1 |
| 3 | 2015-04-02 | 1 |
| 5 | 2015-04-02 | 3 |
+----+------------+---------+
5 rows in set (0.00 sec)
No surprise here, the query returns all the rows from the table. We can see there
are 2 entries having date = '2015-04-01' and 3 entries with date = '2015-04-02'.
Grouping by date, rows 1 and 4 are the first group, rows 2, 3 and 5 are the second
group.
WITH ROLLUP
The WITH ROLLUP modifier computes an extra row that is some kind of super-group.
It aggregates all the rows returned by the query above, creating a single group, behaving
like all the rows had the same value in the date column. There are 5 rows in the set,
this is why the extra row generated by WITH ROLLUP has value 5 in the nb column.
We can simulate the computation of the extra row by running a modified version of the
query (without the WITH ROLLUP modifier) that selects NULL for the date column;
this makes the GROUP BY clause create a single big group.
1 | SELECT NULL AS `null_date`, COUNT(`user_id`) AS nb |
We cannot use date as the name of the selected expression because it will conflict
with the table column date and GROUP BY `date` will use the table column instead.
+-----------+----+
| null_date | nb |
+-----------+----+
| NULL | 5 |
+-----------+----+
1 row in set (0.00 sec)
Asking for DISTINCT values
GROUP BY
Let’s try now to figure out what rows are used to compute the result set when the
original query contains COUNT(DISTINCT `user_id`).
For that we need to add the DISTINCT modifier to the SELECT clause and drop out
the PK column because it guarantees all the rows are distinct and defeats the purpose
of our query (we are interested only in distict combinations of date and user_id).
1 | SELECT DISTINCT `date`, `user_id` |
The rows selected for grouping are:
+------------+---------+
| date | user_id |
+------------+---------+
| 2015-04-01 | 1 |
| 2015-04-01 | 2 |
+------------+---------+
| 2015-04-02 | 1 |
| 2015-04-02 | 3 |
+------------+---------+
4 rows in set (0.00 sec)
The PK column is not there but we can guess the missing row has id = 3 or id = 2
(it does not really matter). These two rows have identical values on the date and
user_id fields. The DISTINCT modifier guarantees only one of them is selected
for further processing by the query.
There are 2 groups, each of them having exactly 2 rows. This is consistent with the
result returned by the GROUP BY query.
WITH ROLLUP
If we replace the values of the date column with NULLs in the result set above
we will see there are only 3 distinct rows (out of 4).
Let’s use the same trick as above to simulate what WITH ROLLUP does.
The modified query is:
1 | SELECT NULL AS `null_date`, COUNT(DISTINCT `user_id`) AS nb |
The result:
+-----------+----+
| null_date | nb |
+-----------+----+
| NULL | 3 |
+-----------+----+
1 row in set (0.00 sec)
Again, the result matches the value returned by MySQL.
Conclusion
In order to compute the extra rows added by the WITH ROLLUP modifier, MySQL does not use
the rows already computed for the GROUP BY query without rollup. In some cases they can be
used but many times they cannot. This makes the correct computation of the summary rows impossible
on the client side (without getting the raw rows, which defeats the entire purpose of using
GROUP BY in the first place).