NOT EXISTS vs NOT IN

12,129

As you have rightly said the two are different things. If the subquery of items to not be IN contains NULL no results will be returned because nothing equals NULL and nothing does not equal NULL (Not even NULL).

Assuming you are using the two to achieve the same result, there is no difference between the two as long as you handle NULL values in your IN statement. The optimiser is clever enough to know that with NULL values eliminated, or with non nullable columns the two are the same, so use the same ANTI SEMI JOIN.

Consider these two tables:

CREATE TABLE T (ID INT NOT NULL PRIMARY KEY);
CREATE TABLE T2 (ID INT NOT NULL PRIMARY KEY);

These two queries get exactly the same execution plan:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T2);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T2 WHERE T.ID = T2.ID);

because the optimiser knows T2.ID is a non nullable column. With a third table:

CREATE TABLE T3 (ID INT);

where the ID column is neither indexed or nullable these two queries render very different execution plans:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T3);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

and NOT EXISTS will be much more efficient. However these two again yield (essentially) the same execution plan:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T3 WHERE T3.ID IS NOT NULL);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

All these queries and sample data are on SQL Fiddle

EDIT

To actually answer your question:

Case 1 will be the same performance with NOT IN or NOT EXISTS if tracked_session_id is a non nullable column in data.conversions, or you add WHERE tracked_Session_id IS NOT NULL inside the In statement. If the column is not nullable and you don't exclude null values the performance won't be the same, and assuming there are no nulls NOT EXISTS will perform better, if there are no nulls the result won't be the same so the performances are not comparable.

Case 2 actually surprised me with sample data, I had assumed that this would not be optimised into an ANTI SEMI JOIN, and had already written an answer saying as much, but just before saving the edit I thought I'd better check, and was surprised to see that this:

SELECT  *
FROM    T
WHERE   (   SELECT  COUNT(*) 
            FROM    T3
            WHERE   T.ID = T3.ID
        ) = 0;

Is optimised exactly the same as NOT EXISTS. So it appears the optimiser is even more clever than I thought, it will only generate a different plan if you want the count to be something other than 0.

SQL Fiddle for Case 2

Share:
12,129
managedheap84
Author by

managedheap84

Software Developer from the North East of England.

Updated on June 04, 2022

Comments

  • managedheap84
    managedheap84 almost 2 years

    I'm optimizing some SQL queries (this could be considered part 2 of a question i recently posted) and replacing some NOT IN with NOT EXISTS predicates

    Am I right in thinking that the main benefit to doing so is that with NOT EXISTS you get the benefit that the statement will terminate when a single match is found, but NOT IN with a counting subquery would would have to do a full table scan?

    It also seems that NOT IN would also require extra work if the data selected contained NULLs, is this correct?

    I need to make sure that the second statement better than the first (and functionally equivalent) in these two cases before I implement them in the proc:

    Case 1:

            --exclude sessions that were tracked as part of a conversion during the last response_time minutes
            -- AND session_id NOT IN (SELECT DISTINCT tracked_session_id    
            --                              FROM data.conversions WITH (NOLOCK)
            --                              WHERE client_id = @client_id
            --                              AND utc_date_completed >= DATEADD(minute, (-2) * cy.response_time, @date)
            --                              AND utc_date_completed <= @date     
            --                              AND utc_date_clicked <= @date)
    
            AND NOT EXISTS (SELECT 1
                                FROM data.conversions WITH (NOLOCK)
                                WHERE client_id = @client_id
                                AND utc_date_completed >= DATEADD(minute, (-2) * cy.response_time, @date)
                                AND utc_date_completed <= @date
                                AND utc_date_clicked <= @date
                                AND data.conversions.tracked_session_id = d.session_id
            )
    

    Case 2:

            -- NOT EXISTS vs full table scan with COUNT(dashboard_id)                                   
            -- AND (SELECT COUNT(dashboard_id)
            --          FROM data.dashboard_responses WITH(NOLOCK)
            --          WHERE session_id = d.session_id
            --          AND cycle_id = cy.id
            --          AND client_id = @client_id) = 0
    
            AND NOT EXISTS(SELECT 1
                                FROM data.dashboard_responses
                                WHERE session_id = d.session_id
                                AND cycle_id = cy.id
                                AND client_id = @client_id)
    

    Cheers

  • Chris Chilvers
    Chris Chilvers over 11 years
    In mssql 2005 they added an optimisation for count(*) = 0 and count(*) > 0 due to many people using that syntax for IF statements. Though exists is the better choice due to being technically correct. blogs.technet.com/b/wardpond/archive/2007/08/27/…