Optimising a SELECT query that runs slow on Oracle which runs quickly on SQL Server

76,253

Solution 1

Often this type of problem goes away if you analyze the tables involved (so Oracle has a better idea of the distribution of the data)

ANALYZE TABLE tasks COMPUTE STATISTICS;

Solution 2

The "IN" - clause is known in Oracle to be pretty slow. In fact, the internal query optimizer in Oracle cannot handle statements with "IN" pretty good. try using "EXISTS":

SELECT orderID FROM tasks WHERE orderID NOT EXISTS 
    (SELECT DISTINCT orderID FROM tasks WHERE
         engineer1 IS NOT NULL AND engineer2 IS NOT NULL)`print("code sample");`

Caution: Please check if the query builds the same data results.

Edith says: ooops, the query is not well formed, but the general idea is correct. Oracle has to fulfill a full table scan for the second (inner) query, build the results and then compare them to the first (outer) query, that's why it's slowing down. Try

SELECT orderID AS oid FROM tasks WHERE NOT EXISTS 
    (SELECT DISTINCT orderID AS oid2 FROM tasks WHERE
         engineer1 IS NOT NULL AND engineer2 IS NOT NULL and oid=oid2)

or something similiar ;-)

Solution 3

I would try using joins instead

SELECT 
    t.orderID 
FROM 
    tasks  t
    LEFT JOIN tasks t1
        ON t.orderID =  t1.orderID
        AND t1.engineer1 IS NOT NULL 
        AND t1.engineer2 IS NOT NULL
WHERE
    t1.orderID IS NULL 

also your original query would probably be easier to understand if it was specified as:

SELECT orderID FROM orders WHERE orderID NOT IN 
(SELECT DISTINCT orderID FROM tasks WHERE
 engineer1 IS NOT NULL AND engineer2 IS NOT NULL)

(assuming you have orders table with all the orders listed)

which can be then rewritten using joins as:

SELECT 
    o.orderID 
FROM 
    orders o
    LEFT JOIN tasks t
        ON o.orderID =  t.orderID
        AND t.engineer1 IS NOT NULL 
        AND t.engineer2 IS NOT NULL
WHERE
    t.orderID IS NULL 

Solution 4

I agree with TZQTZIO, I don't get your query.

If we assume the query did make sense then you might want to try using EXISTS as some suggest and avoid IN. IN is not always bad and there are likely cases which one could show it actually performs better than EXISTS.

The question title is not very helpful. I could set this query up in one Oracle database and make it run slow and make it run fast in another. There are many factors that determine how the database resolves the query, object statistics, SYS schema statistics, and parameters, as well as server performance. Sqlserver vs. Oracle isn't the problem here.

For those interested in query tuning and performance and want to learn more some of the google terms to search are "oak table oracle" and "oracle jonathan lewis".

Solution 5

Some questions:

  • How many rows are there in tasks?
  • What indexes are defined on it?
  • Has the table been analyzed recently?

Another way to write the same query would be:

select orderid from tasks
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL

However, I would rather expect the query to involve an "orders" table:

select orderid from ORDERS
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL

or

select orderid from ORDERS
where orderid not in
( select orderid from tasks
  where engineer1 IS NOT NULL AND engineer2 IS NOT NULL
)

or

select orderid from ORDERS
where not exists
( select null from tasks
  where tasks.orderid = orders.orderid
  and   engineer1 IS NOT NULL OR engineer2 IS NOT NULL
)
Share:
76,253
RickL
Author by

RickL

Updated on July 05, 2022

Comments

  • RickL
    RickL almost 2 years

    I'm trying to run the following SQL statement in Oracle, and it takes ages to run:

    SELECT orderID FROM tasks WHERE orderID NOT IN 
    (SELECT DISTINCT orderID FROM tasks WHERE
     engineer1 IS NOT NULL AND engineer2 IS NOT NULL)
    

    If I run just the sub-part that is in the IN clause, that runs very quickly in Oracle, i.e.

    SELECT DISTINCT orderID FROM tasks WHERE
    engineer1 IS NOT NULL AND engineer2 IS NOT NULL
    

    Why does the whole statement take such a long time in Oracle? In SQL Server the whole statement runs quickly.

    Alternatively is there a simpler/different/better SQL statement I should use?

    Some more details about the problem:

    • Each order is made of many tasks
    • Each order will be allocated (one or more of its task will have engineer1 and engineer2 set) or the order can be unallocated (all its task have null values for the engineer fields)
    • I am trying to find all the orderIDs that are unallocated.

    Just in case it makes any difference, there are ~120k rows in the table, and 3 tasks per order, so ~40k different orders.

    Responses to answers:

    • I would prefer a SQL statement that works in both SQL Server and Oracle.
    • The tasks only has an index on the orderID and taskID.
    • I tried the NOT EXISTS version of the statement but it ran for over 3 minutes before I cancelled it. Perhaps need a JOIN version of the statement?
    • There is an "orders" table as well with the orderID column. But I was trying to simplify the question by not including it in the original SQL statement.

    I guess that in the original SQL statement the sub-query is run every time for each row in the first part of the SQL statement - even though it is static and should only need to be run once?

    Executing

    ANALYZE TABLE tasks COMPUTE STATISTICS;
    

    made my original SQL statement execute much faster.

    Although I'm still curious why I have to do this, and if/when I would need to run it again?

    The statistics give Oracle's cost-based optimzer information that it needs to determine the efficiency of different execution plans: for example, the number of rowsin a table, the average width of rows, highest and lowest values per column, number of distinct values per column, clustering factor of indexes etc.

    In a small database you can just setup a job to gather statistics every night and leave it alone. In fact, this is the default under 10g. For larger implementations you usually have to weigh the stability of the execution plans against the way that the data changes, which is a tricky balance.

    Oracle also has a feature called "dynamic sampling" that is used to sample tables to determine relevant statistics at execution time. It's much more often used with data warehouses where the overhead of the sampling it outweighed by the potential performance increase for a long-running query.

  • Maurício
    Maurício over 15 years
    No it's not. I made the same mistake :-) Each order has mutliple tasks, and if one of those tasks has an engineer assigned, the order counts as "allocated"
  • Mac
    Mac over 15 years
    I came to the same query (see my answer below), except for : * the subquery has no reason to select DISTINCT orderIDs. * drop the "orderID" between WHERE and NOT EXISTS (syntax error). * drop the 'print("code sample"), obviously ;-)
  • Georgi
    Georgi over 15 years
    I thought of that, too, but I think this query is not the fastest, too.
  • RickL
    RickL over 15 years
    I tried this but it was still running over a minute later when I cancelled it.
  • RickL
    RickL over 15 years
    When I try the second query it gets an error? ORA-00904: "OID2": invalid identifier
  • Georgi
    Georgi over 15 years
    Use the "AS oid2" I inserted, it was not in the query before.
  • Georgi
    Georgi over 15 years
    @Mac: Oooops, 'print("code sample")', hmmm... my only excuse is that I'm working on a new laptop and didn't see it :) - thanks for your hints!
  • Dave Costa
    Dave Costa over 15 years
    Your second proposal is NOT the same thing, since there can be multiple tasks per orderID, some assigned and some not. He only wants orderIDs for which NONE of the tasks are assigned.
  • Dave Costa
    Dave Costa over 15 years
    Your last proposal would need to have a correlated subquery (e.g. add "AND tasks.orderID = orders.orderID"). You can't simply change NOT IN to NOT EXISTS without modifying the subquery.
  • RickL
    RickL over 15 years
    Hmm, no, it still gives the same error... tried in SQL Plus and Toad Freeware.
  • Peter Dutton
    Peter Dutton over 15 years
    Yes, be careful, the minus statement uses alot of memory
  • Georgi
    Georgi over 15 years
    Uh, I don't have a Oracle on my laptop at the moment, but give this a try: Completely skip the "AS oid2" and end the query with "oid=orderID)" - if you cant fix it, eliminate it ;-)
  • David Aldridge
    David Aldridge over 15 years
    "The "IN" - clause is known in Oracle to be pretty slow. In fact, the internal query optimizer in Oracle cannot handle statements with "IN" pretty good. try using "EXISTS":" Nonense. Where's the justification for this?
  • RickL
    RickL over 15 years
    Thanks, I tried this and it is the correct syntax, but it was still running for over 3 minutes when I cancelled it.
  • RickL
    RickL over 15 years
    This doesn't work because it should only return orders where all tasks engineer fields are not null, but this will return orders where some tasks are engineer noll, and some tasks are engineer not null. Did you mean for the WHERE clause to be different?
  • RickL
    RickL over 15 years
    AJ gave the correct syntax for the NOT EXISTS version of the SQL statement, but it is still too slow (over 3 minutes).
  • kristof
    kristof over 15 years
    It think it returns the same result as the query that you specified: SELECT orderID FROM tasks WHERE orderID NOT IN (SELECT DISTINCT orderID FROM tasks WHERE engineer1 IS NOT NULL AND engineer2 IS NOT NULL) which I believe returns:All orders that do not have any tasks with both engineers assigned
  • kristof
    kristof over 15 years
    But perhaps I am missing something, if one assigned engineer is enough not to list the order then change the left join to read as: LEFT JOIN tasks t1 ON t.orderID = t1.orderID AND (t1.engineer1 IS NOT NULL OR t1.engineer2 IS NOT NULL) But that would be different from your original query.
  • David Aldridge
    David Aldridge over 15 years
    Nothing is necessary "bad". The difference between the benefits of different techniques almost always comes down to the value distributions of a particular data set and the presence or absence of indexes and constraints.
  • David Aldridge
    David Aldridge over 15 years
    This is an obsolete syntax for collecting statistics. DBMS_STATS is a more robust way. download.oracle.com/docs/cd/B19306_01/server.102/b14211/…
  • Ethan Post
    Ethan Post over 15 years
    I agree, this is obsolete. Please try to avoid using it. Use DBMS_STATS and make sure you also get your indexes, you can set cascade=>true when you gather stats for the table.
  • David Aldridge
    David Aldridge over 15 years
    Normally the implicit distinct would be unecessary overhead, but in this case it would probably be appropriate. Two scans of the table might be inefficient though.
  • Ishbir
    Ishbir over 15 years
    Typically, the engineer fields will be foreign keys, so the indexes should be there.
  • Jim Birchall
    Jim Birchall over 15 years
    @ΤΖΩΤΖΙΟΥ Thanks for the downvote, but did you read the question? "The tasks only has an index on the orderID and taskID"
  • Admin
    Admin over 15 years
    Dave's on a Dynamic Sampling Binge
  • Admin
    Admin over 15 years
    Georgi, where is your proof that IN is "slow" and the CBO can't handle it? If you're going to answer questions here, try to do so from a position of understanding, instead of one from hearsay and supposition
  • David Aldridge
    David Aldridge over 15 years
    I loves the dynamic sampling. I don't know it isn't spoken of more often.
  • E_the_Shy
    E_the_Shy over 14 years
    @Dave, only if your queries are big and infrequent. Otherwise you spend more time sampling that doing.
  • David Aldridge
    David Aldridge over 14 years
    Well it all bears measuring of course. DS impact depends on a lot of issues, and you can get information from a DS query that lasts a tiny fraction of a second that can take a query from 10 seconds down to a fraction of a second. For example dynamic sampling of a part of a query in which multiple predicates run against a small dimension table (public holidays on a Monday in a particular year? Women over 5'11''?) need not take long but can be very valuable. Long story short, it's just another tool to understand and appreciate for what it is.
  • usr
    usr over 14 years
    SQL Server allows only one NULL row in an index except you create a filtered index. In this case it does not include NULLs.
  • orbfish
    orbfish over 13 years
    Actually, IN is not so bad, it sometimes runs faster than EXISTS (probably due to caching, I've never tracked this down). NOT IN (or NOT EXISTS), however, can be notoriously slow.
  • Chad
    Chad almost 11 years
    Note, I tried Analyze today and it didn't do anything. DO NOT USE IT. Instead use DBMS_STATS instead. A query went from 5 minutes to 0.3 seconds with DBMS_STATS