Slow query on "UNION ALL" view

sql performance postgresql indexing union-all

28,571

Solution 1

This seems to be a case of a pilot error. The "v" query plan selects from at least 5 different tables.

Now, Are You sure You are connected to the right database? Maybe there are some funky search_path settings? Maybe t1 and t2 are actually views (possibly in a different schema)? Maybe You are somehow selecting from the wrong view?

Edited after clarification:

You are using a quite new feature called "join removal" : http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.0#Join_Removal

http://rhaas.blogspot.com/2010/06/why-join-removal-is-cool.html

It appears that the feature does not kick in when union all is involved. You probably have to rewrite the view using only the required two tables.

another edit: You appear to be using an aggregate (like "select count(*) from v" vs. "select * from v"), which could get vastly different plans in face of join removal. I guess we won't get very far without You posting the actual queries, view and table definitions and plans used...

Solution 2

I believe your query is being executed similar to:

(
   ( SELECT time, etc. FROM t1 // #1... )
   UNION ALL
   ( SELECT time, etc. FROM t2 // #2... )
)
WHERE time >= ... AND time < ...

which the optimizer is having difficulty optimizing. i.e. it's doing the UNION ALL first before applying the WHERE clause but, you wish it to apply the WHERE clause before the UNION ALL.

Couldn't you put your WHERE clause in the CREATE VIEW?

CREATE VIEW v AS
( SELECT time, etc. FROM t1  WHERE time >= ... AND time < ... )
UNION ALL
( SELECT time, etc. FROM t2  WHERE time >= ... AND time < ... )

Alternatively if the view cannot have the WHERE clause, then, perhaps you can keep to the two views and do the UNION ALL with the WHERE clause when you need them:

CREATE VIEW v1 AS
SELECT time, etc. FROM t1 // #1...

CREATE VIEW v2 AS
SELECT time, etc. FROM t2 // #2...

( SELECT * FROM v1 WHERE time >= ... AND time < ... )
UNION ALL
( SELECT * FROM v2 WHERE time >= ... AND time < ... )

Solution 3

I do not know Postgres, but some RMDBs handle comparison operators worse than BETWEEN in case of indexes. I would make an attempt using BETWEEN.

SELECT ... FROM v WHERE time BETWEEN ... AND ...

Solution 4

A possibility would be to issue a new SQL dynamically at each call instead of creating a view and to integrate the where clause in each SELECT of the union query

SELECT time, etc. FROM t1
    WHERE time >= ... AND time < ...
UNION ALL
SELECT time, etc. FROM t2
    WHERE time >= ... AND time < ...

EDIT:

Can you use a parametrized function?

CREATE OR REPLACE FUNCTION CallMyView(t1 date, t2 date)
RETURNS TABLE(d date, etc.)
AS $$
    BEGIN
        RETURN QUERY
            SELECT time, etc. FROM t1
                WHERE time >= t1 AND time < t2
            UNION ALL
            SELECT time, etc. FROM t2
                WHERE time >= t1 AND time < t2;
    END;
$$ LANGUAGE plpgsql;

Call

SELECT * FROM CallMyView(..., ...);

Solution 5

Combine the two tables. Add a column to indicate original table. If necessary, replace the original table names with views that select just the relevant part. Problem solved!

Looking into the superclass/subclass db design pattern could be of use to you.

View more solutions

28,571

Author by

Mladen Jablanović

Updated on May 07, 2020

Comments

Mladen Jablanović almost 4 years

I have a DB view which basically consists of two SELECT queries with UNION ALL, like this:

CREATE VIEW v AS
SELECT time, etc. FROM t1 // #1...
UNION ALL
SELECT time, etc. FROM t2 // #2...

The problem is that selects of the form

SELECT ... FROM v WHERE time >= ... AND time < ...

perform really really slow on it.

Both SELECT #1 and #2 are decently fast, properly indexed and so on: when I create views v1 and v2 like:

CREATE VIEW v1 AS
SELECT time, etc. FROM t1 // #1...

CREATE VIEW v2 AS
SELECT time, etc. FROM t2 // #2...

And the same SELECT, with same WHERE condition as the above works OK on them individually.

Any ideas about where might be the problem and how to solve it?

(Just to mention, it's one of the recent Postgres versions.)

Edit: Adding anonymized query plans (thaks to @filiprem for the link to an awesome tool):

v1:

Aggregate  (cost=9825.510..9825.520 rows=1 width=53) (actual time=59.995..59.995 rows=1 loops=1)
  ->  Index Scan using delta on echo alpha  (cost=0.000..9815.880 rows=3850 width=53) (actual time=0.039..53.418 rows=33122 loops=1)
          Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey))
          Filter: ((NOT victor) AND ((bravo_sierra five NULL) OR ((bravo_sierra)::golf <> 'india'::golf)))

v2:

Aggregate  (cost=15.470..15.480 rows=1 width=33) (actual time=0.231..0.231 rows=1 loops=1)
  ->  Index Scan using yankee on six charlie  (cost=0.000..15.220 rows=99 width=33) (actual time=0.035..0.186 rows=140 loops=1)
          Index Cond: (("juliet" >= 'seven'::uniform bravo oscar whiskey) AND ("juliet" <= 'november'::uniform bravo oscar whiskey))
          Filter: (NOT victor)

Aggregate  (cost=47181.850..47181.860 rows=1 width=0) (actual time=37317.291..37317.291 rows=1 loops=1)
  ->  Append  (cost=42.170..47132.480 rows=3949 width=97) (actual time=1.277..37304.453 rows=33262 loops=1)
        ->  Nested Loop Left Join  (cost=42.170..47052.250 rows=3850 width=99) (actual time=1.275..37288.465 rows=33122 loops=1)
              ->  Hash Left Join  (cost=42.170..9910.990 rows=3850 width=115) (actual time=1.123..117.797 rows=33122 loops=1)
                      Hash Cond: ((alpha_seven.two)::golf = (quebec_three.two)::golf)
                    ->  Index Scan using delta on echo alpha_seven  (cost=0.000..9815.880 rows=3850 width=132) (actual time=0.038..77.866 rows=33122 loops=1)
                            Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey_two) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey_two))
                            Filter: ((NOT victor) AND ((bravo_sierra five NULL) OR ((bravo_sierra)::golf <> 'india'::golf)))
                    ->  Hash  (cost=30.410..30.410 rows=941 width=49) (actual time=1.068..1.068 rows=941 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 75kB
                          ->  Seq Scan on alpha_india quebec_three  (cost=0.000..30.410 rows=941 width=49) (actual time=0.010..0.486 rows=941 loops=1)
              ->  Index Scan using mike on hotel quebec_sierra  (cost=0.000..9.630 rows=1 width=24) (actual time=1.112..1.119 rows=1 loops=33122)
                      Index Cond: ((alpha_seven.zulu)::golf = (quebec_sierra.zulu)::golf)
        ->  Subquery Scan on "*SELECT* 2"  (cost=34.080..41.730 rows=99 width=38) (actual time=1.081..1.951 rows=140 loops=1)
              ->  Merge Right Join  (cost=34.080..40.740 rows=99 width=38) (actual time=1.080..1.872 rows=140 loops=1)
                      Merge Cond: ((quebec_three.two)::golf = (charlie.two)::golf)
                    ->  Index Scan using whiskey_golf on alpha_india quebec_three  (cost=0.000..174.220 rows=941 width=49) (actual time=0.017..0.122 rows=105 loops=1)
                    ->  Sort  (cost=18.500..18.750 rows=99 width=55) (actual time=0.915..0.952 rows=140 loops=1)
                            Sort Key: charlie.two
                            Sort Method:  quicksort  Memory: 44kB
                          ->  Index Scan using yankee on six charlie  (cost=0.000..15.220 rows=99 width=55) (actual time=0.022..0.175 rows=140 loops=1)
                                  Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey_two) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey_two))
                                  Filter: (NOT victor)

juliet is time.

stian.net about 12 years

column "time" in your view is not indexed. You will have to manually index that column in your view. Take a look at the execution plan
Admin about 12 years

Will queries against this view always be constrained by time?
Mladen Jablanović about 12 years

@stian.net: Not sure what you suggest. I can't add indexes on view columns, and both underlying tables are properly indexed on time field(s).
Mladen Jablanović about 12 years

@MarkBannister: Yes. I would like to avoid creating materialized view or whatever it is called, if that was going to be a suggestion. :)
filiprem about 12 years

can you tell what is "really slow"? what are times for queries #1, #2, and #3? You may just show the outputs of EXPLAIN (ANALYZE,BUFFERS) for all queries.
Mladen Jablanović about 12 years

@filiprem: v1 - 60ms, v2 - 0.2ms, v - 37317ms. I am not sure whether I am allowed to disclose actual table names and fields, I might replace them with generic names and paste here later today.
filiprem about 12 years

Mladen, explain output formatter and anonymizer -> explain.depesz.com
Admin about 12 years

@MladenJablanović: I was thinking more in terms of adding hints to the view (hints are normally deprecated in views, but can be valid where the view is only to be accessed by a specific path), but then I discovered that PostgreSQL doesn't use hints - see here: stackoverflow.com/questions/309786/… and here: wiki.postgresql.org/wiki/OptimizerHintsDiscussion for related discussions, including alternatives to hints.
DRapp about 12 years

On the 7th day with no apparent answer. Could you actually post the two actual queries too, not just the query plan itself...
Walter Mitty about 12 years

What happens if your view is on UNION DISTINCT instead of UNION ALL? Does it perform faster? Does it give wrong results?

Mladen Jablanović about 12 years

I doubt that this is the case here, as from execution query plan v you can see that both subqueries are constrained by time field (julia), so I'm pretty sure that there is no huge temp table created onto which time constraint is applied afterwards.
Mladen Jablanović about 12 years

v indeed queries from >2 different tables, as v1 and v2 query from >2 different tables too (for evaluating various columns). Just it seems that these columns are not evaulated when pulling from v1 and v2 individually, but are when querying v.
Mladen Jablanović about 12 years

This doesn't work for me, I definitely need "one view to rule them all". :)
Mladen Jablanović about 12 years

Thanks, that seems to be leading in the right direction. I will try to provide more info on how these queries looked like.
bjan about 12 years

" I'm pretty sure that there is no huge temp table" You might be correct, but a DBA or one having exact idea about what is going behind the scene can confirm this. Let the answer come, the answer which would explain exactly why the query V is taking time.
gpeche about 12 years

Oracle >= 8i with the cost-based optimizer (default) does NOT usually do what you say. It will do it if the optimizer thinks it is the best / only option, but that does not happen very often.
maniek about 12 years

You might be in luck, look at this commit from 4 days ago: git.postgresql.org/gitweb/… - looks relevant. If You are on 9.1, wait till 9.1.3 gets released, and upgrade.
Erwin Brandstetter about 12 years

@maniek: I actually doubt his luck. Bug #6416 concerns indexes on expressions and there was no mention of that in the question.
bjan about 12 years

@gpeche I clearly mentioned that i could not post comment so posted it as an answer which means it was not an exact answer. I used would in ` Oracle would work` means not ALWAYS and you mentioned the same too does NOT **usually**. I posted my answer (comment) on Jan 30, a delay of 3 days for a question having 200 bounty!!! that is why i tried to give @Mladen a clue (i think no one would give a bounty of 200 unless (s)he is in a critical situation) but now it looks i should not have. I should post an answer only if i am 100% sure. Let's wait for an exact answer
Mladen Jablanović about 12 years

@maniek: You were right, in cases of execution plans pasted above I was using COUNT(*). I won't have the time to extract the actual queries we are using from the application itself in the next couple of days, but I will do it eventually. Are "join removals" reason for not all the tables from the JOINs are used when querying v1 and v2, but are used when querying v?
maniek about 12 years

@MladenJablanović: Join removal is a feature that can, in some limited cases, figure out that joining to a table is not necessary. For example consider query: select t1.id from t1 left join t2 on t2.t1_id=t1.id . When there is an unique index on t2.t1_id , the join to t2 is redundant, and the planner can figure it out. But the cases where planner can figure it out are limited, and my reasoning is that the planner can't prove that the joins are redundant in the more complex "v" view. This is a case that could be improved in later postgres versions.
bjan about 12 years

@MladenJablanović: Part1: The explain plan of v as compare to v1 and v2 includes additional cost for Sort, Merge Right Join, Subquery Scan, Hash, Hash Left Join, Nested Loop Left Join, Append so it is obvious that this query is slower because of these additional costs. Now the question is why is such execution plan generated which is atleast not required for the given query?. It might be it is the heuristic of the optimizer or it is there for 80:20 rule or anything else ...
bjan about 12 years

@MladenJablanović: Part2: This is not clear in the 9.1 docs. Even i could not find it in the first look at The design and implementation of the POSTGRES query optimizer and The PostgreSQL Optimizer Exposed. I hope you could find some clue from these sources and get the exact answer to your question yourself :)
Ondřej Bouda over 9 years

Didn't believe it would help until actually tried. Now, when the WHERE condition touches the same column from the same table, the query is much faster. Thanks for this tip!