Cumulative sum over a set of rows in mysql
Solution 1
UPDATE
MySQL 8.0 introduces "window functions", functionality equivalent to SQL Server "window functions" (with partitioning and ordering provided by Transact-SQL OVER
syntax), and Oracle "analytic functions".
MySQL Reference Manual 12.21 Window Functions https://dev.mysql.com/doc/refman/8.0/en/window-functions.html
The answer provided here is an approach for MySQL versions prior to 8.0.
ORIGINAL ANSWER
MySQL doesn't provide the type analytic function you would use to get a running "cumulative sum", like the analytic functions available in other DBMS (like Oracle or SQL Server.)
But, it is possible to emulate some analytic functions, using MySQL.
There are (at least) two workable approaches:
One is to use a correlated subquery to get the subtotal. This approach can be expensive on large sets, and complicated if the predicates on the outer query are complicated. It really depends on how complicated that "multiple joins on multiple tables" is. (Unfortunately, MySQL also does not not support CTEs either.)
The other approach is to make use of MySQL user variables, to do some control break processing. The "trick" here is to the results from your query sorted (using an ORDER BY) and then wrapping your query in another query.
I'll give an example of the latter approach.
Because of the order that MySQL performs operations, the cumulative_total
column needs to be computed before the value from id
and day
from the current row are saved into user variables. It's just easiest to put this column first.
The inline view aliased as i (in the query below) is just there to initialize the user variables, just in case these are already set in the session. If those already have values assigned, we want to ignore their current values, and the easiest way to do that is to initialize them.
Your original query gets wrapped in parenthesis, and is given an alias, c
in the example below. The only change to your original query is the addition of an ORDER BY clause, so we can be sure that we process the rows from the query in sequence.
The outer select checks whether the id
and day
value from the current row "match" the previous row. If they do, we add the amount
from the current row to the cumulative subtotal. If they don't match, then we reset the the cumulative subtotal to zero, and add the amount from the current row (or, more simply, just assign the amount from the current row).
After we have done the computation of the cumulative total, we save the id
and day
values from the current row into user variables, so they are available when we process the next row.
For example:
SELECT IF(@prev_id = c.id AND @prev_day = c.day
,@cumtotal := @cumtotal + c.amount
,@cumtotal := c.amount) AS cumulative_total
, @prev_id := c.id AS `id`
, @prev_day := c.day AS `day`
, c.hr
, c.amount AS `amount'
FROM ( SELECT @prev_id := NULL
, @prev_day := NULL
, @subtotal := 0
) i
JOIN (
select id, day, hr, amount from
( //multiple joins on multiple tables)a
left join
(//unions on multiple tables)b
on a.id=b.id
ORDER BY 1,2,3
) c
If it's necessary to return the columns in a different order, with cumulative total as the last column, then one option is to wrap that whole statement in a set of parens, and use that query as an inline view:
SELECT d.id
, d.day
, d.hr
, d.amount
, d.cumulative_total
FROM (
// query from above
) d
Solution 2
If you're on MySQL 8 or later, you should use window functions for this. Your query would read:
SELECT
id, day, hr, amount,
SUM (amount) OVER (PARTITION BY id, day ORDER BY hr) AS `cumulative total`
FROM t
Where t
is your table b
left joined to a
. Some notes:
- The
PARTITION BY
clause guarantees that you get a cumulative sum perid
andday
, so each day, we start summing afresh - The
ORDER BY
clause defines by what ordering the cumulation should happen
user1051577
Updated on January 10, 2020Comments
-
user1051577 over 4 years
I have a complex query(containing multiple joins, unions) that returns a set of rows containing id, day, hr, amount. The output of the query looks like this:
id day hr amount 1 1 1 10 1 1 2 25 1 1 3 30 1 2 1 10 1 2 2 40 1 2 2 30 2 1 1 10 2 1 2 15 2 1 3 30 2 2 1 10 2 2 2 20 2 2 2 30
I need to find cumulative total for each id, for every hour of the day. The output should be like this:
id day hr amount cumulative total 1 1 1 10 10 1 1 2 25 35 1 1 3 30 65 1 2 1 10 10 1 2 2 40 50 1 2 2 30 80 2 1 1 10 10 2 1 2 15 25 2 1 3 30 55 2 2 1 10 10 2 2 2 20 30 2 2 2 30 60
My initial query that produces the first output looks like this:
select id, day, hr, amount from ( //multiple joins on multiple tables)a left join (//unions on multiple tables)b on a.id=b.id;
What's sql query to get the cumulative sum as described in the second output? SET should not be used in the solution.
Thanks.
-
SQL.injection almost 11 yearsyou can write a simple query... look at the answer below, it is a simply query that does the trick.
-
spencer7593 almost 11 yearsIn the simple query, the OP original query would need to be specified twice, in place of "foo". (If any change is made to the original query, it will need to be modified in two places.) The output from the simple query does not meet the specification, at least in the case of the fifth row in the OP example, since there are two rows
id=1 day=2 hr=2
. According to the spec, the subtotal of the fifth row should not include the amount from the sixth row. -
spencer7593 almost 11 yearsThis does not return the specified resultset if there are duplicates of
(id, day, hr)
, as in the fifth and sixth rows in the OP example. In the OP case, foo is not a simple table, but rather a query that involves multiple tables. This query will need to be specified twice (in place of foo), which means MySQL will materialize that query twice. Unfortunately, MySQL does not yet support Common Table Expressions (CTE) which would be one way to avoid duplicating the subquery. -
SQL.injection almost 11 yearsvery sharp indeed :) didn't noticed that detail in the initial data
-
spencer7593 almost 11 yearsIf we did have a guarantee of uniqueness (on the columns we need for the join), then the semi-join approach used by the query in this answer would return the specified result.
-
SQL.injection almost 11 yearsyeah, I know. I didn't noticed the duplicates when I did my initial inspection of the problem...
-
spencer7593 almost 11 yearsN.B. The answers provided in the question that was marked as a "duplicate" will not produce the resultset you specified. Those will produce a running total for the entire set, and not for each group. Those queries also do not handle the duplicate instances of the grouping key.
-
Lukasz Szozda over 5 yearsIt is really nice to see more and more RDBMSes support such constructs. Next big thing I would like to be implemented is MATCH_RECOGNIZE - btw. your blog is really great(10 SQL Tricks That You Didn’t Think Were Possible inspired me :)
-
Lukas Eder over 5 years@LukaszSzozda: Thanks for your nice words. I'm not sure if
MATCH_RECOGNIZE
will be a priority on MySQL anytime soon, but you never know! -
Thomas Lobker about 5 yearsThis approach is very fast as well. I've had some trouble to get a cumulative sum on about 15 columns in a couple of milion rows. I've tried many different approaches. The solution from @lukas-eder is nice, but it will take longer and longer if the number of rows (to sum) is increasing. This solution from @spencer7593 doesn't really care how large your set is and it will stay fast.
25000 rows in 0.19 sec
vs4.43 sec
-
haneulkim about 4 yearshow come we don't get cumulative some if we don't add order by clause? even though I don't give order shouldn't it add each row to new sum?
-
Lukas Eder about 4 years@Ambleu: If you use
ORDER BY
, thenRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is implicit (cumulative sum semantics). If you don't useORDER BY
, thenRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
is implicit (aggregation of entire partition semantics). This is actually quite a useful default behaviour.