Query with a UNION sub-query takes a very long time

13,884

Solution 1

It turns out the problem was due to one of the indexes ... tblFavourites contained two foreign keys to the primary key (uid) in tblUser:

userId
otherUserId

both columns had the same definition and same indexes, but I discovered that swapping userId for otherUserId in the original query solved the problem.

I ran:

ALTER INDEX ALL ON tblFavourites REBUILD

... and the problem went away. The query now executes almost instantly.

I don't know too much about what goes on behind the scenes in Sql Server/Azure ... but I can only imagine that it was a damaged index or something? I update statistics frequently, but that had no effect.

Thanks!

---- UPDATE

The above was not fully correct. It did fix the problem for around 20 minutes, then it returned. I have been in touch with Microsoft support for several days and it seems the problem is to do with the tempDB. They are working on a solution at their end.

Solution 2

UNION generate unique values, so the DBMS engine makes sorts. You can use safely UNION ALL in this case.

Solution 3

UNION is really doing a DISTINCT on all fields in the combined data set. It filters out dupes in the final results.

Is Uid indexed? If not it may take a long time as the query engine:

  • Generates the first result set
  • Generates the second result set
  • Filters out all the dupes (which is half the records) in a hash table

If duplicates aren't a concern (and using IN means they won't be) then use UNION ALL which removes the expensive Sort/Filter step.

Solution 4

UNION's are usually implemented via temporary in-memory tables. You're essentially copying your tblUser two times into memory, WITH NO INDEX. Then every row in tblFavourites incur a complete table scan over 200,000 rows - that's 200Kx200K=40 billion double-row scans (because the query engine must get the uid from both table rows)

If your tblUser has an index on uid (which is definitely true because all tables in SQL Azure must have a clustered index), then each row in tblFavourites incurs a very fast index lookup, resulting in only 200Kxlog(100K) =200Kx17 = 200K row scans, each with 17 b-tree index comparisons (which is much faster than reading the uid from a row on a data page), so it should equate to roughly 200Kx(3-4) or 1 million double-row scans. I believe newer versions of SQL server may also build a temp hash table containing just the uid's, so essentially it gets down to 200K row scans (assuming hash table lookup to be trivial).

You should also generate your query plan to check.

Essentially, the non-UNION query runs around 500,000 times faster if tblUser has an index (should be on SQL Azure).

Share:
13,884
Steven Elliott
Author by

Steven Elliott

Updated on June 08, 2022

Comments

  • Steven Elliott
    Steven Elliott almost 2 years

    I've been having an odd problem with some queries that depend on a sub query. They run lightning fast, until I use a UNION statement in the sub query. Then they run endlessly, I've given after 10 minutes. The scenario I'm describing now isn't the original one I started with, but I think it cuts out a lot of possible problems yet yields the same problem. So even though it's a pointless query, bear with me!

    I have a table:

    tblUser - 100,000 rows
    tblFavourites - 200,000 rows
    

    If I execute:

    SELECT COUNT(*) 
    FROM tblFavourites 
    WHERE userID NOT IN (SELECT uid FROM tblUser);  
    

    … then it runs in under a second. However, if I modify it so that the sub query has a UNION, it will run for at least 10 minutes (before I give up!)

    SELECT COUNT(*) 
    FROM tblFavourites 
    WHERE userID NOT IN (SELECT uid FROM tblUser UNION SELECT uid FROM tblUser);  
    

    A pointless change, but it should yield the same result and I don't see why it should take any longer?

    Putting the sub-query into a view and calling that instead has the same effect.

    Any ideas why this would be? I'm using SQL Azure.


    Problem solved. See my answer below.


  • Steven Elliott
    Steven Elliott almost 13 years
    Thanks for the quick response. I changed it to UNION ALL and it was still running more than 5 minutes later. However, I downloaded the database to my local SQL Server 2008 and the problem has disappeared. The query with the UNION runs almost as quickly as without. I then ran the query on a separate SQL Azure DB and had the same problem. So it looks like a SQL Azure issue. I'll contact Azure support. Thanks!
  • Stephen Chung
    Stephen Chung almost 13 years
    How can you have a foreign key from tblFavourites.userID to tblUser.uid, and have userID NOT IN (SELECT uid FROM tblUser) return true unless userID is null? In that case, testing WHERE userID IS NULL is a much faster way to query.