Optimizing multiple joins

46,560

Solution 1

Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.

SELECT
    p.period,
    /* The pivot technique... */
    SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
    SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
    /* Get days range */
    SELECT date '2009-10-01' + day AS period
    FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
    /* Join message information */
    LEFT OUTER JOIN
    (
        SELECT b.body_size, b.body_time::date, e.envelope_command
        FROM body AS b 
            INNER JOIN envelope e ON e.message_id = b.message_id 
        WHERE
            e.envelope_command IN (2, 1)
            AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
    ) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period

Solution 2

There are always 2 things to consider when optimising queries:

  • What indexes can be used (you may need to create indexes)
  • How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)

A few observations:

  • You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.

  • Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?

  • NOTE: We would actually very much like to use an index on Body.body_time

  • ORDER BY within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.

  • Use of LEFT JOIN in your subqueries is inappropriate. Assuming you're using ANSI conventions for NULL behaviour (and you should be), any outer joins to envelope would return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.

  • Subqueries o and i are almost identical save for the envelope_command value. This forces the optimser to scan the same underlying tables twice. You can use a pivot table technique to join to the data once, and split the values into 2 columns.

Try the following which uses the pivot technique:

SELECT  p.period,
        /*The pivot technique in action...*/
        SUM(
        CASE WHEN envelope_command = 1 THEN body_size
        ELSE 0
        END) AS Outbound,
        SUM(
        CASE WHEN envelope_command = 2 THEN body_size
        ELSE 0
        END) AS Inbound
FROM    (
        SELECT  date '2009-10-01' + s.day AS period
        FROM    generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
        ) AS p 
        /*The left JOIN is justified to ensure ALL generated dates are returned
          Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
        LEFT OUTER JOIN (
        SELECT  b.body_size,
                b.body_time,
                e.envelope_command
        FROM    body AS b 
                INNER JOIN envelope e 
                  ON e.message_id = b.message_id 
        WHERE   envelope_command IN (1, 2)
        ) d
          /*The expressions below allow the optimser to use an index on body_time if 
            the statistics indicate it would be beneficial*/
          ON d.body_time >= p.period
         AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period

EDIT: Added filter suggested by Tom H.

Solution 3

I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.

The keys are:

  1. You shouldn't need the subqueries - just do the direct joins and aggregate
  2. You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs

If nothing else, I think that the query below is a bit clearer.

I used a calendar table in my query, but you can replace that with the generate_series as you were using it.

Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).

SELECT
    CAL.calendar_date AS period,
    SUM(O.body_size) AS outbound,
    SUM(I.body_size) AS inbound
FROM
    Calendar CAL
INNER JOIN Body OB ON
    OB.body_time::date = CAL.calendar_date
INNER JOIN Envelope OE ON
    OE.message_id = OB.message_id AND
    OE.envelope_command = 1
INNER JOIN Body IB ON
    IB.body_time::date = CAL.calendar_date
INNER JOIN Envelope IE ON
    IE.message_id = IB.message_id AND
    IE.envelope_command = 2
GROUP BY
    CAL.calendar_date
Share:
46,560
salathe
Author by

salathe

Software engineer at Akamai. Volunteer for the PHP project, as the editor for the PHP manual. Join me in the PHP chat room.

Updated on June 22, 2020

Comments

  • salathe
    salathe almost 4 years

    I'm trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along with an EXPLAIN ANALYZE to show just how horrible it is.

    If anyone could take a peek and see if they can spot any major issues (which is likely, I'm not a Postgres guy) that would be superb.

    So here goes. The query is:

    SELECT 
     to_char(p.period, 'DD/MM/YY') as period,
     coalesce(o.value, 0) AS outbound,
     coalesce(i.value, 0) AS inbound
    FROM (
     SELECT
      date '2009-10-01' + s.day 
      AS period 
      FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
    ) AS p 
    LEFT OUTER JOIN(
     SELECT
      SUM(b.body_size) AS value, 
      b.body_time::date AS period 
     FROM body AS b 
     LEFT JOIN 
      envelope e ON e.message_id = b.message_id 
     WHERE 
      e.envelope_command = 1 
      AND b.body_time BETWEEN '2009-10-01' 
      AND (date '2009-10-31' + INTERVAL '1 DAY') 
     GROUP BY period 
     ORDER BY period
    ) AS o ON p.period = o.period
    LEFT OUTER JOIN( 
     SELECT 
      SUM(b.body_size) AS value, 
      b.body_time::date AS period 
     FROM body AS b 
     LEFT JOIN 
      envelope e ON e.message_id = b.message_id 
     WHERE 
      e.envelope_command = 2 
      AND b.body_time BETWEEN '2009-10-01' 
      AND (date '2009-10-31' + INTERVAL '1 DAY') 
     GROUP BY period 
     ORDER BY period
    ) AS i ON p.period = i.period 
    

    The EXPLAIN ANALYZE can be found here: on explain.depesz.com

    Any comments or questions are appreciated.

    Cheers

  • Disillusioned
    Disillusioned over 14 years
    In general INNER JOINS should be preferred, but occasionally you want all rows from one set even if there are no matches in a joined set. It seems to me OP wanted ALL generated dates, and used the COALESCE function to ensure that Inbound and Outbound would display 0 if there were no matching rows. Hence your joins to Body should be LEFT OUTER to meet that requirement. Unfortunately then, (if I'm not mistaken) the inner joins to Envelope could still eliminate some date rows. The subqueries serve to reduce the confusion around joining. However, OP did err in using LEFT JOINS to Envelope.
  • Tom H
    Tom H over 14 years
    Whoops, you're right there Craig. Do you know if PostgreSQL supports nested JOINs? My guess (untested) is that nested joins would perform better than the subqueries.
  • Tom H
    Tom H over 14 years
    Good use of CASE to limit the joins. You could add to the constraint of envelope though that envelope_command IN (1, 2)
  • Disillusioned
    Disillusioned over 14 years
    Aah yes, quite correct; that would improve performance because it would eliminate rows that would otherwise be introduced into the aggregation with zero values in both columns. (I.e. would eliminate rows that do not affect the result at all)
  • salathe
    salathe over 14 years
    Using this suggested query, execution time goes up from the original ~2 seconds to ~23 seconds! I'll update when I have more information.
  • salathe
    salathe over 14 years
    Adding an extra condition after envelope_command IN (1, 2) of AND b.body_time BETWEEN '2009-10-01' AND (date '2009-10-31' + INTERVAL '1 DAY') takes the time down to ~6.5 seconds. Conclusion: still more work to be done.
  • salathe
    salathe over 14 years
    Building on (or making a mess of) the above answer, the currently most performant solution uses the pivot technique to save on double-joining and casting to date. It is too long to post in a comment so will be in a separate answer.
  • Disillusioned
    Disillusioned over 14 years
    Do you have an index on Body.Body_time? Is it clustered or not? What other columns are included in the index? What is the order of those columns in the index?
  • salathe
    salathe over 14 years
    There is an index on body.body_time (see below) and no other columns in the body table. The index is simply CREATE INDEX body_time_idx ON body USING btree (body_time);
  • Disillusioned
    Disillusioned over 14 years
    As I pointed out in my answer; optimisation must be approached from two angles. Choosing indexes, and writing your query in such a way that the optimiser will consider your index. NB You should not actually 'tell' the optimser to use a specific index using query hints; you should let the optimser make the decision based on table statistics. You'll want to create indexes, run the query, and examine the execution plan to see what the optimiser did. The most important information in the execution plan is: What indexes did the optimser choose? (continued...)
  • Disillusioned
    Disillusioned over 14 years
    (... continued) If a particular version of your query doesn't use an index you expected, you need to try figure out why. Similarly, even if the plan does use the indexes you expect, but still takes 'too long', you again need to figure out why. This brings me to the second piece of information you'll look for in your execution plan: how much data is processed at each step, and what kind of processing is done? You can try the following indexes: Body: (body_time, message_id, body_size) or (message_id, body_time, body_size). Also, if it's selective enough, try envelope:(envelope_command)
  • salathe
    salathe over 14 years
    Thanks Craig for your continued input. I'm going to mark this solved for now as you've been a great help and there are other bigger fish to fry. Cheers.