How to Efficiently use SQL to Retrieve Data on Half Hour Intervals?
Solution 1
You can try this single query as well, it should return a result set with the totals in 30 minute groupings:
SELECT date, MIN(time) as time, SUM(subtotal) as total
FROM `Receipts`
WHERE `date` = '2012-07-30'
GROUP BY hour(time), floor(minute(time)/30)
To run this efficiently, add a composite index on the date and time columns.
You should get back a result set like:
+---------------------+--------------------+
| time | total |
+---------------------+--------------------+
| 2012-07-30 00:00:00 | 0.000000000 |
| 2012-07-30 00:30:00 | 0.000000000 |
| 2012-07-30 01:00:00 | 0.000000000 |
| 2012-07-30 01:30:00 | 0.000000000 |
| 2012-07-30 02:00:00 | 0.000000000 |
| 2012-07-30 02:30:00 | 0.000000000 |
| 2012-07-30 03:00:00 | 0.000000000 |
| 2012-07-30 03:30:00 | 0.000000000 |
| 2012-07-30 04:00:00 | 0.000000000 |
| 2012-07-30 04:30:00 | 0.000000000 |
| 2012-07-30 05:00:00 | 0.000000000 |
| ...
+---------------------+--------------------+
Solution 2
First, I would use a single DATETIME column, but using a DATE and TIME column will work.
You can do all the work in one pass using a single query:
select date,
hour(`time`) hour_num,
IF(MINUTE(`time`) < 30, 0, 1) interval_num,
min(`time`) interval_begin,
max(`time`) interval_end,
sum(subtotal) sum_subtotal
from receipts
where date='2012-07-31'
group by date, hour_num, interval_num;
Solution 3
UPDATE:
Since you aren't concerned with any "missing" rows, I'm also going to assume (probably wrongly) that you aren't concerned that the query might possibly return rows for periods that are not from 7AM to 12AM. This query will return your specified result set:
SELECT (HOUR(r.time)-7)*2+(MINUTE(r.time) DIV 30) AS i
, SUM(r.subtotal) AS sum_subtotal
FROM Receipts r
GROUP BY i
ORDER BY i
This returns the period index (i) derived from an expression referencing the time
column. For best performance of this query, you probably want to have a "covering" index available, for example:
ON Receipts(`time`,`subtotal`)
If you are going to include an equality predicate on the date
column (which does not appear in your solution, but which does appear in the solution of the "selected" answer, then it would be good to have that column as a leading index in the "covering" index.
ON Receipts(`date`,`time`,`subtotal`)
If you want to ensure that you are not returning any rows for periods before 7AM, then you could simply add a HAVING i >= 0
clause to the query. (Rows for periods before 7AM would generate a negative number for i.)
SELECT (HOUR(r.time)-7)*2+(MINUTE(r.time) DIV 30) AS i
, SUM(r.subtotal) AS sum_subtotal
FROM Receipts r
GROUP BY i
HAVING i >= 0
ORDER BY i
PREVIOUSLY:
I've assumed that you want a result set similar to the one you are currently returning, but in one fell swoop. This query will return the same 33 rows you are currently retrieving, but with an extra column identifying the period (0 - 33). This is as close to your current solution that I could get:
SELECT t.i
, IFNULL(SUM(r.subtotal),0) AS sum_subtotal
FROM (SELECT (d1.i + d2.i + d4.i + d8.i + d16.i + d32.i) AS i
, ADDTIME('07:00:00',SEC_TO_TIME((d1.i+d2.i+d4.i+d8.i+d16.i+d32.i)*1800)) AS b_time
, ADDTIME('07:30:00',SEC_TO_TIME((d1.i+d2.i+d4.i+d8.i+d16.i+d32.i)*1800)) AS e_time
FROM (SELECT 0 i UNION ALL SELECT 1) d1 CROSS
JOIN (SELECT 0 i UNION ALL SELECT 2) d2 CROSS
JOIN (SELECT 0 i UNION ALL SELECT 4) d4 CROSS
JOIN (SELECT 0 i UNION ALL SELECT 8) d8 CROSS
JOIN (SELECT 0 i UNION ALL SELECT 16) d16 CROSS
JOIN (SELECT 0 i UNION ALL SELECT 32) d32
HAVING i <= 33
) t
LEFT
JOIN Receipts r ON r.time >= t.b_time AND r.time < t.e_time
GROUP BY t.i
ORDER BY t.i
Some important notes:
It looks like your current solution may be "missing" rows from Receipts whenever the the seconds is exactly equal to '59' or '00'.
It also looks like you aren't concerned with the date component, you are just getting a single value for all dates. (I may have misread that.) If so, the separation of the DATE and TIME columns helps with this, because you can reference the bare TIME column in your query.
It's easy to add a WHERE clause on the date
column. e.g. to get the subtotal rollups for just a single day e.g. add a WHERE clause before the GROUP BY
.
WHERE r.date = '2011-09-10'
A covering index ON Receipts(time,subtotal)
(if you don't already have a covering index) may help with performance. (If you include an equality predicate on the date column (as in the WHERE clause above, the most suitable covering index would likely be ON Receipts(date,time,subtotal)
.
I've made an assumption that the time
column is of datatype TIME. (If it isn't, then a small adjustment to the query (in the inline view aliased as t
) is probably called for, to have the datatype of the (derived) b_time and e_time columns match the datatype of the time
column in Receipts.
Some of proposed solutions in other answers are not guaranteed to return 33 rows, when there are no rows in Receipts within a given time period. "Missing rows" may not be an issue for you, but it is a frequent issue with timeseries and timeperiod data.
I've made the assumption that you would prefer to have a guarantee of 33 rows returned. The query above returns a subtotal of zero when no rows are found matching a time period. (I note that your current solution will return a NULL in that case. I've gone and wrapped that SUM aggregate in an IFNULL function, so that it will return a 0 when the SUM is NULL.)
So, the inline query aliased as t
is an ugly mess, but it works fast. What it's doing is generating 33 rows, with distinct integer values 0 thru 33. At the same time, it derives a "begin time" and an "end time" that will be used to "match" each period to the time
column on the Receipts
table.
We take care not to wrap the time
column from the Receipts table in any functions, but reference just the bare column. And we want to ensure we don't have any implicit conversion going on (which is why we want the datatypes of b_time and e__time to match. The ADDTIME
and SEC_TO_TIME
functions both return TIME
datatype. (We can't get around doing the matching and the GROUP BY operations.)
The "end time" value for that last period is returned as "24:00:00", and we verify that this is a valid time for matching by running this test:
SELECT MAKETIME(23,59,59) < MAKETIME(24,0,0)
which is successful (returns a 1) so we're good there.
The derived columns (t.b_time
and t.e_time
) could be included in the resultset as well, but they aren't needed to create your array, and it's (likely) more efficient if you don't include them.
And one final note: for optimal performance, it may be beneficial to load the inline view aliased as t
into an actual table (a temporary table would be fine.), and then you could reference the table in place of the inline view. The advantage of doing that is that you could create an index on that table.
PontusTrade
Updated on July 04, 2022Comments
-
PontusTrade almost 2 years
Problem - Retrieve sum of subtotals on a half hour interval efficiently
I am using MySQL and I have a table containing subtotals with different times. I want to retrieve the sum of these sales on a half hour interval from 7 am through 12 am. My current solution (below) works but takes 13 seconds to query about 150,000 records. I intend to have several million records in the future and my current method is too slow.
How I can make this more efficient or if possible replace the PHP component with pure SQL? Also, would it help your solution to be even more efficient if I used Unix timestamps instead of having a date and time column?
Table Name - Receipts
subtotal date time sale_id -------------------------------------------- 6 09/10/2011 07:20:33 1 5 09/10/2011 07:28:22 2 3 09/10/2011 07:40:00 3 5 09/10/2011 08:05:00 4 8 09/10/2011 08:44:00 5 ............... 10 09/10/2011 18:40:00 6 5 09/10/2011 23:05:00 7
Desired Result
An array like this:
- Half hour 1 ::: (7:00 to 7:30) => Sum of Subtotal is 11
- Half hour 2 ::: (7:30 to 8:00) => Sum of Subtotal is 3
- Half hour 3 ::: (8:00 to 8:30) => Sum of Subtotal is 5
- Half hour 4 ::: (8:30 to 9:00) => Sum of Subtotal is 8
Current Method
The current way uses a for loop which starts at 7 am and increments 1800 seconds, equivalent to a half hour. As a result, this makes about 34 queries to the database.
for($n = strtotime("07:00:00"), $e = strtotime("23:59:59"); $n <= $e; $n += 1800) { $timeA = date("H:i:s", $n); $timeB = date("H:i:s", $n+1799); $query = $mySQL-> query ("SELECT SUM(subtotal) FROM Receipts WHERE time > '$timeA' AND time < '$timeB'"); while ($row = $query-> fetch_object()) { $sum[] = $row; } }
Current Output
Output is just an array where:
- [0] represents 7 am to 7:30 am
- [1] represents 7:30 am to 8:00 am
[33] represents 11:30 pm to 11:59:59 pm.
array ("0" => 10000, "1" => 20000, .............. "33" => 5000);