axiac@web

DISTINCT vs. GROUP BY WITH ROLLUP

or why 2+2 is not always 4

WITH ROLLUP is a GROUP BY modifier that causes extra rows to be added to the result set produced by the query. These rows represent higher-level (or super-aggregate) summary operations.

A shallow examination of the effects of WITH ROLLUP on a query like:

SELECT user_id, COUNT(product) AS nb_products, SUM(amount) AS total FROM cart GROUP BY user_id

can lead to the erroneous conclusion that the WITH ROLLUP modifier is just a courtesy of the database implementors to the front-end developers and the extra rows added by it could be computed on the client-side as well. Big mistake!

A deeper investigation on the GROUP BY aggregate functions reveals that there are functions whose values generated by WITH ROLLUP cannot be computed on the client side using only the values returned by the GROUP BY query. These are the statistical functions (STDDEV_POP(), VAR_POP() a.s.o.) that are not used by many applications but, surprisingly, also the more commonly used SUM() and COUNT(), when they are combined with DISTINCT.

Case Study

Let’s create a small MySQL table in the database and put some rows in it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE `logins` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` INT(11) NOT NULL,
`date` DATE NOT NULL,
`browser` VARCHAR(255) DEFAULT NULL,
PRIMARY KEY (`id`)
);

INSERT INTO `logins` (`user_id`, `date`, `browser`)
VALUES
(1, '2015-04-01', 'Firefox'),
(1, '2015-04-02', 'Firefox'),
(1, '2015-04-02', 'Chrome'),
(2, '2015-04-01', 'Chrome'),
(3, '2015-04-02', 'Chrome');

It is a naïve log table that stores the customer login dates and the browsers they are using.

Let’s count the number of logins for each day:

1
2
3
4
SELECT `date`, COUNT(`user_id`) AS nb
FROM `logins`
GROUP BY `date`
WITH ROLLUP;

The result is straightforward:

+------------+----+
| date       | nb |
+------------+----+
| 2015-04-01 |  2 |
| 2015-04-02 |  3 |
| NULL       |  5 |
+------------+----+
3 rows in set (0.01 sec)

The WITH ROLLUP clause produced a nice total and doesn’t provide any surprise: 2 + 3 = 5, any first-grade child can assure you about the correctness of this result.

Where DISTINCT comes into play

Now, let’s take a look to the raw table data and we’ll discover that on 2015-04-02 the first user logged in twice using two different browsers. Let’s change the query to count the unique users for each day, i.e. count the first user only once for 2015-04-02:

1
2
3
4
SELECT `date`, COUNT(DISTINCT `user_id`) AS nb
FROM `logins`
GROUP BY `date`
WITH ROLLUP;
+------------+----+
| date       | nb |
+------------+----+
| 2015-04-01 |  2 |
| 2015-04-02 |  2 |
| NULL       |  3 |
+------------+----+
3 rows in set (0.00 sec)

Surprise, surprise! We have 2 unique visitors for each of the two days and a big total of 3 which is not quite the calculation one would expect.

Sure, going back to the raw table data we can see there are only 3 unique values for user_id in the table and that makes it impossible to return 4 or more as the result of COUNT(DISTINCT). This is easy to see at a glance on our example table but not that easy on a production table that contains thousands or millions of records.

When 2+2 equals 3

In order to find out and understand why MySQL returns this result let’s analyze how GROUP BY, WITH ROLLUP and DISTINCT work.

Let’s find out what rows are selected in each of the queries above. This can be done easily by removing (commenting out) the GROUP BY clause (together with its WITH ROLLUP modifier) and removing the aggregate function COUNT(). I also added the PK field to easily identify the rows in our table and ORDER BY to easily identify the groups.

All values (without DISTINCT)

GROUP BY

1
2
3
SELECT `id`, `date`, `user_id`
FROM `logins`
ORDER BY `date`;
+----+------------+---------+
| id | date       | user_id |
+----+------------+---------+
|  1 | 2015-04-01 |       1 |
|  4 | 2015-04-01 |       2 |
+----+------------+---------+
|  2 | 2015-04-02 |       1 |
|  3 | 2015-04-02 |       1 |
|  5 | 2015-04-02 |       3 |
+----+------------+---------+
5 rows in set (0.00 sec)

No surprise here, the query returns all the rows from the table. We can see there are 2 entries having date = '2015-04-01' and 3 entries with date = '2015-04-02'. Grouping by date, rows 1 and 4 are the first group, rows 2, 3 and 5 are the second group.

WITH ROLLUP

The WITH ROLLUP modifier computes an extra row that is some kind of super-group. It aggregates all the rows returned by the query above, creating a single group, behaving like all the rows had the same value in the date column. There are 5 rows in the set, this is why the extra row generated by WITH ROLLUP has value 5 in the nb column.

We can simulate the computation of the extra row by running a modified version of the query (without the WITH ROLLUP modifier) that selects NULL for the date column; this makes the GROUP BY clause create a single big group.

1
2
3
SELECT NULL AS `null_date`, COUNT(`user_id`) AS nb
FROM `logins`
GROUP BY `null_date`;

We cannot use date as the name of the selected expression because it will conflict with the table column date and GROUP BY `date` will use the table column instead.

+-----------+----+
| null_date | nb |
+-----------+----+
|      NULL |  5 |
+-----------+----+
1 row in set (0.00 sec)

Asking for DISTINCT values

GROUP BY

Let’s try now to figure out what rows are used to compute the result set when the original query contains COUNT(DISTINCT `user_id`).

For that we need to add the DISTINCT modifier to the SELECT clause and drop out the PK column because it guarantees all the rows are distinct and defeats the purpose of our query (we are interested only in distict combinations of date and user_id).

1
2
3
SELECT DISTINCT `date`, `user_id`
FROM `logins`
ORDER BY `date`;

The rows selected for grouping are:

+------------+---------+
| date       | user_id |
+------------+---------+
| 2015-04-01 |       1 |
| 2015-04-01 |       2 |
+------------+---------+
| 2015-04-02 |       1 |
| 2015-04-02 |       3 |
+------------+---------+
4 rows in set (0.00 sec)

The PK column is not there but we can guess the missing row has id = 3 or id = 2 (it does not really matter). These two rows have identical values on the date and user_id fields. The DISTINCT modifier guarantees only one of them is selected for further processing by the query.

There are 2 groups, each of them having exactly 2 rows. This is consistent with the result returned by the GROUP BY query.

WITH ROLLUP

If we replace the values of the date column with NULLs in the result set above we will see there are only 3 distinct rows (out of 4).

Let’s use the same trick as above to simulate what WITH ROLLUP does. The modified query is:

1
2
3
SELECT NULL AS `null_date`, COUNT(DISTINCT `user_id`) AS nb
FROM `logins`
GROUP BY `null_date`;

The result:

+-----------+----+
| null_date | nb |
+-----------+----+
|      NULL |  3 |
+-----------+----+
1 row in set (0.00 sec)

Again, the result matches the value returned by MySQL.

Conclusion

In order to compute the extra rows added by the WITH ROLLUP modifier, MySQL does not use the rows already computed for the GROUP BY query without rollup. In some cases they can be used but many times they cannot. This makes the correct computation of the summary rows impossible on the client side (without getting the raw rows, which defeats the entire purpose of using GROUP BY in the first place).

Comments