How do I calculate a running SUM on a SQLite query?

16,189

Solution 1

You can do it by joining the table with itself (performing a so-called Cartesian or cross join). See the following example.

SELECT a.name, a.gdppc, SUM(b.gdppc)
FROM gdppc AS a, gdppc AS b WHERE b.gdppc <= a.gdppc 
GROUP BY b.id ORDER BY a.gdppc;

Given a table containing countries and their per capita GDP it will give you a running total of the GDP figure.

Democratic Republic of Congo|329.645|329.645
Zimbabwe|370.465|700.11
Liberia|385.417|1085.527
Burundi|399.657|1485.184
Eritrea|678.954|2164.138
Niger|711.877|2876.015
Central African Republic|743.945|3619.96
Sierra Leone|781.594|4401.554
Togo|833.803|5235.357
Malawi|867.063|6102.42
Mozambique|932.511|7034.931
...

Note that this can be a very resource-intensive operation, because if a table has N elements it will create a temporary table with N*N elements. I would not perform it on a large table.

Solution 2

As of SQLite 3.25.0, since 2018-09-15, window functions and their keyword OVER are supported. The answer to your question is now easy:

SELECT Country, Gdp, SUM(Gdp) OVER (ROWS UNBOUNDED PRECEDING)
FROM CountryGdp;

This is the minimal query that does what you request, but it doesn't define any ordering so here is a more proper way of doing it.

SELECT
    Country,
    Gdp,
    SUM(Gdp) OVER (
        ORDER BY Country -- Window ordering (not necessarily the same as result ordering!)
        ROWS BETWEEN -- Window for the SUM includes these rows:
            UNBOUNDED PRECEDING -- all rows before current one in window ordering
            AND CURRENT ROW -- up to and including current row.
        ) AS RunningTotal
FROM CountryGdp
ORDER BY Country;

In any way, the query should run in O(N) time.

Solution 3

Cross join solutions like Diomidis Spinellis suggested take O(N^2) time. A recursive CTE can work faster, if you can stomach the convoluted code.

This produces the same output as his.

WITH RECURSIVE running(id, name, gdppc, rt) AS (
    SELECT row1._rowid_, row1.name, row1.gdppc, COALESCE(row1.gdppc,0)
    FROM gdppc AS row1
    WHERE row1._rowid_ = (
        SELECT a._rowid_
        FROM gdppc AS a
        ORDER BY a.gdppc, a.name, a._rowid_
        LIMIT 1)
    UNION ALL
    SELECT row_n._rowid_, row_n.name, row_n.gdppc, COALESCE(row_n.gdppc,0)+running.rt
    FROM gdppc AS row_n INNER JOIN running
    ON row_n._rowid_ = (
        SELECT a._rowid_
        FROM gdppc AS a
        WHERE (a.gdppc, a.name, a._rowid_) > (running.gdppc, running.name, running.id)
        ORDER BY a.gdppc, a.name, a._rowid_
        LIMIT 1))
SELECT running.name, running.gdppc, running.rt
FROM running;

Ordering and comparisons take care of duplicates, COALESCE is there to ignore NULLs.

If you have a good index, this should be O(N log N). Since SQLite doesn't support cursors, an O(N) solution probably doesn't exist without relying on an external application.

Share:
16,189

Related videos on Youtube

Hugo
Author by

Hugo

Updated on April 15, 2020

Comments

  • Hugo
    Hugo over 3 years

    How do I get a column that is the sum of all the values before of another column?

    • OMG Ponies
      OMG Ponies over 12 years
      Please provide example data and expected output.
  • Hugo
    Hugo over 12 years
    Thank you. In your example, what is the TABLE name?
  • Andriy M
    Andriy M over 12 years
    @Diomidis Spinellis: Actually it's not a cross join, but rather a triangular join.
  • Hugo
    Hugo over 12 years
    I am sorry but I didn´t understand. In the FROM statement where should I write my TABLE name?
  • Diomidis Spinellis
    Diomidis Spinellis over 12 years
    @Hugo gdppc is the table name
  • Diomidis Spinellis
    Diomidis Spinellis over 12 years
    @Andriy M Right! The WHERE clause makes the cross join a triangular one.
  • Hugo
    Hugo over 12 years
    So, "a" and "b" are the fields?
  • Andriy M
    Andriy M over 12 years
    @Hugo: a and b are the aliases of each of the two instances of gdppc. But the gdppc table seems also to have a column named the same, gdppc. That is probably the source of your confusion.
  • Fermat's Little Student
    Fermat's Little Student over 6 years
    great solution!
  • relatively_random
    relatively_random almost 6 years
    I believe it's possible to get N log N with a CTE. See my answer.
  • jayarjo
    jayarjo over 4 years
    And what if you pre-select some subset of the table and do the cross-join on it, will it be a better from performance viewpoint?
  • jayarjo
    jayarjo over 4 years
    Ouch... I wonder one thing only, won't it be better to simply retrieve the whole set and then do the grouping and calculation in the actual programming code? Why bother with this incomprehensible syntax.
  • relatively_random
    relatively_random almost 4 years
    Years since this best answer was accepted, SQLite finally has a better solution. See my answer below: stackoverflow.com/a/58339386/4247453
  • Spaceship222
    Spaceship222 over 3 years
    what is id here?
  • Matt Joiner
    Matt Joiner about 3 years
    Does this really run in O(N)? In my experience it's still O(N^2)
  • relatively_random
    relatively_random about 3 years
    Back when I wrote the answer, I think I checked the query plan opcode output. It used coroutines to get O(N) runtime. I doubt things have changed for the worse with the newer versions, but I guess it's possible. Also, who knows what output it can produce for slightly changed ordering, indices or whatever. SQL doesn't really promise any performance characteristics AFAIK.
  • relatively_random
    relatively_random about 3 years
    In other words: good point, and you should check for yourself if it matters for your use case.
  • Matt Joiner
    Matt Joiner about 3 years
    I couldn't improve on this: Windowing functions will unnecessarily recompute large aggregate values, but recursive CTE can avoid this. Thank you for this answer.
  • Srdjan
    Srdjan almost 3 years
    this works only if all gdppc are different. if all values are equal, every row sum would be equalt to end sum. to correct this you should replace WHERE b.gdppc <= a.gdppc with WHERE b.id <= a.id
  • ANKIT
    ANKIT over 2 years
    Not recommended if data size is large. I tested this method on a table with 5000 entries and the app was slow to respond . Any one knows how to do this in room database.
  • Diomidis Spinellis
    Diomidis Spinellis about 2 years
    @ANKIT Try the other answers, which are based on SQLite functionality that was added in the years that passed since I answered the question. Measure before adopting a specific solution, and let us know.
  • ANKIT
    ANKIT about 2 years
    I tried to use OVER clause as suggested by relatively_unknown but window functions are not supported in room database and I got compile errors
  • Davor Josipovic
    Davor Josipovic over 1 year
    The reason why it is O(N) is because of how window functions are implemented. See here. Sqlite will never compute each partition from scratch, but will use xStep and xInverse to "roll over" the output. Simple and efficient.
  • relatively_random
    relatively_random over 1 year
    @DavorJosipovic This just means it's capable of doing it in O(N). Unfortunately, it doesn't guarantee it. Imagine if the window ordering was by some other random column (e.g. Population). The query would need to build a temporary table and look up the values for every row, which would be O(N log N) in total. In this example, none of that is needed, of course, but nobody ever guarantees that the query planner will be smart enough to properly optimize any specific query you give it. Though I would be surprised if the query was O(N^2) like Matt Joiner said.