How to Efficiently use SQL to Retrieve Data on Half Hour Intervals?

12,501

Solution 1

You can try this single query as well, it should return a result set with the totals in 30 minute groupings:

SELECT date, MIN(time) as time, SUM(subtotal) as total
FROM `Receipts`
WHERE `date` = '2012-07-30'
GROUP BY hour(time), floor(minute(time)/30)

To run this efficiently, add a composite index on the date and time columns.

You should get back a result set like:

+---------------------+--------------------+
| time                | total              |
+---------------------+--------------------+
| 2012-07-30 00:00:00 |        0.000000000 |
| 2012-07-30 00:30:00 |        0.000000000 |
| 2012-07-30 01:00:00 |        0.000000000 |
| 2012-07-30 01:30:00 |        0.000000000 |
| 2012-07-30 02:00:00 |        0.000000000 |
| 2012-07-30 02:30:00 |        0.000000000 |
| 2012-07-30 03:00:00 |        0.000000000 |
| 2012-07-30 03:30:00 |        0.000000000 |
| 2012-07-30 04:00:00 |        0.000000000 |
| 2012-07-30 04:30:00 |        0.000000000 |
| 2012-07-30 05:00:00 |        0.000000000 |
| ...
+---------------------+--------------------+

Solution 2

First, I would use a single DATETIME column, but using a DATE and TIME column will work.

You can do all the work in one pass using a single query:

select date,
       hour(`time`) hour_num, 
       IF(MINUTE(`time`) < 30, 0, 1) interval_num, 
       min(`time`) interval_begin,
       max(`time`) interval_end,
       sum(subtotal) sum_subtotal
 from receipts
where date='2012-07-31'
group by date, hour_num, interval_num;

Solution 3

UPDATE:

Since you aren't concerned with any "missing" rows, I'm also going to assume (probably wrongly) that you aren't concerned that the query might possibly return rows for periods that are not from 7AM to 12AM. This query will return your specified result set:

SELECT (HOUR(r.time)-7)*2+(MINUTE(r.time) DIV 30) AS i 
     , SUM(r.subtotal) AS sum_subtotal
  FROM Receipts r
 GROUP BY i
 ORDER BY i

This returns the period index (i) derived from an expression referencing the time column. For best performance of this query, you probably want to have a "covering" index available, for example:

ON Receipts(`time`,`subtotal`)

If you are going to include an equality predicate on the date column (which does not appear in your solution, but which does appear in the solution of the "selected" answer, then it would be good to have that column as a leading index in the "covering" index.

ON Receipts(`date`,`time`,`subtotal`)

If you want to ensure that you are not returning any rows for periods before 7AM, then you could simply add a HAVING i >= 0 clause to the query. (Rows for periods before 7AM would generate a negative number for i.)

SELECT (HOUR(r.time)-7)*2+(MINUTE(r.time) DIV 30) AS i 
     , SUM(r.subtotal) AS sum_subtotal
  FROM Receipts r
 GROUP BY i
HAVING i >= 0
 ORDER BY i

PREVIOUSLY:

I've assumed that you want a result set similar to the one you are currently returning, but in one fell swoop. This query will return the same 33 rows you are currently retrieving, but with an extra column identifying the period (0 - 33). This is as close to your current solution that I could get:

SELECT t.i
     , IFNULL(SUM(r.subtotal),0) AS sum_subtotal
  FROM (SELECT (d1.i + d2.i + d4.i + d8.i + d16.i + d32.i) AS i
             , ADDTIME('07:00:00',SEC_TO_TIME((d1.i+d2.i+d4.i+d8.i+d16.i+d32.i)*1800)) AS b_time
             , ADDTIME('07:30:00',SEC_TO_TIME((d1.i+d2.i+d4.i+d8.i+d16.i+d32.i)*1800)) AS e_time
          FROM (SELECT 0 i UNION ALL SELECT 1) d1 CROSS
          JOIN (SELECT 0 i UNION ALL SELECT 2) d2 CROSS
          JOIN (SELECT 0 i UNION ALL SELECT 4) d4 CROSS
          JOIN (SELECT 0 i UNION ALL SELECT 8) d8 CROSS
          JOIN (SELECT 0 i UNION ALL SELECT 16) d16 CROSS
          JOIN (SELECT 0 i UNION ALL SELECT 32) d32
        HAVING i <= 33
       ) t
  LEFT
  JOIN Receipts r ON r.time >= t.b_time AND r.time < t.e_time
 GROUP BY t.i
 ORDER BY t.i

Some important notes:

It looks like your current solution may be "missing" rows from Receipts whenever the the seconds is exactly equal to '59' or '00'.

It also looks like you aren't concerned with the date component, you are just getting a single value for all dates. (I may have misread that.) If so, the separation of the DATE and TIME columns helps with this, because you can reference the bare TIME column in your query.

It's easy to add a WHERE clause on the date column. e.g. to get the subtotal rollups for just a single day e.g. add a WHERE clause before the GROUP BY.

WHERE r.date = '2011-09-10'

A covering index ON Receipts(time,subtotal) (if you don't already have a covering index) may help with performance. (If you include an equality predicate on the date column (as in the WHERE clause above, the most suitable covering index would likely be ON Receipts(date,time,subtotal).

I've made an assumption that the time column is of datatype TIME. (If it isn't, then a small adjustment to the query (in the inline view aliased as t) is probably called for, to have the datatype of the (derived) b_time and e_time columns match the datatype of the time column in Receipts.

Some of proposed solutions in other answers are not guaranteed to return 33 rows, when there are no rows in Receipts within a given time period. "Missing rows" may not be an issue for you, but it is a frequent issue with timeseries and timeperiod data.

I've made the assumption that you would prefer to have a guarantee of 33 rows returned. The query above returns a subtotal of zero when no rows are found matching a time period. (I note that your current solution will return a NULL in that case. I've gone and wrapped that SUM aggregate in an IFNULL function, so that it will return a 0 when the SUM is NULL.)

So, the inline query aliased as t is an ugly mess, but it works fast. What it's doing is generating 33 rows, with distinct integer values 0 thru 33. At the same time, it derives a "begin time" and an "end time" that will be used to "match" each period to the time column on the Receipts table.

We take care not to wrap the time column from the Receipts table in any functions, but reference just the bare column. And we want to ensure we don't have any implicit conversion going on (which is why we want the datatypes of b_time and e__time to match. The ADDTIME and SEC_TO_TIME functions both return TIME datatype. (We can't get around doing the matching and the GROUP BY operations.)

The "end time" value for that last period is returned as "24:00:00", and we verify that this is a valid time for matching by running this test:

SELECT MAKETIME(23,59,59) < MAKETIME(24,0,0)

which is successful (returns a 1) so we're good there.

The derived columns (t.b_time and t.e_time) could be included in the resultset as well, but they aren't needed to create your array, and it's (likely) more efficient if you don't include them.


And one final note: for optimal performance, it may be beneficial to load the inline view aliased as t into an actual table (a temporary table would be fine.), and then you could reference the table in place of the inline view. The advantage of doing that is that you could create an index on that table.

Share:
12,501
PontusTrade
Author by

PontusTrade

Updated on July 04, 2022

Comments

  • PontusTrade
    PontusTrade almost 2 years

    Problem - Retrieve sum of subtotals on a half hour interval efficiently

    I am using MySQL and I have a table containing subtotals with different times. I want to retrieve the sum of these sales on a half hour interval from 7 am through 12 am. My current solution (below) works but takes 13 seconds to query about 150,000 records. I intend to have several million records in the future and my current method is too slow.

    How I can make this more efficient or if possible replace the PHP component with pure SQL? Also, would it help your solution to be even more efficient if I used Unix timestamps instead of having a date and time column?

    Table Name - Receipts

    subtotal    date        time      sale_id
    --------------------------------------------
       6        09/10/2011  07:20:33     1
       5        09/10/2011  07:28:22     2
       3        09/10/2011  07:40:00     3
       5        09/10/2011  08:05:00     4
       8        09/10/2011  08:44:00     5
    ...............
      10        09/10/2011  18:40:00     6
       5        09/10/2011  23:05:00     7
    

    Desired Result

    An array like this:

    • Half hour 1 ::: (7:00 to 7:30) => Sum of Subtotal is 11
    • Half hour 2 ::: (7:30 to 8:00) => Sum of Subtotal is 3
    • Half hour 3 ::: (8:00 to 8:30) => Sum of Subtotal is 5
    • Half hour 4 ::: (8:30 to 9:00) => Sum of Subtotal is 8

    Current Method

    The current way uses a for loop which starts at 7 am and increments 1800 seconds, equivalent to a half hour. As a result, this makes about 34 queries to the database.

    for($n = strtotime("07:00:00"), $e = strtotime("23:59:59"); $n <= $e; $n += 1800) {  
    
        $timeA = date("H:i:s", $n);
        $timeB = date("H:i:s", $n+1799);
    
        $query = $mySQL-> query ("SELECT SUM(subtotal)
                                  FROM Receipts WHERE time > '$timeA' 
                                  AND time < '$timeB'");
    
        while ($row = $query-> fetch_object()) {
            $sum[] = $row;
        }
    }
    

    Current Output

    Output is just an array where:

    • [0] represents 7 am to 7:30 am
    • [1] represents 7:30 am to 8:00 am
    • [33] represents 11:30 pm to 11:59:59 pm.

      array ("0" => 10000, "1" => 20000, .............. "33" => 5000);