SQL Server subqueries - where in vs where exists

Subqueries (also called subselects) are queries within queries, and are supported in most SQL databases.

For example, in January 2012 in the comp.lang.clarion there was a thread (titled "Need help with SQL select") in which Rocky Phelps asked:

I have a Cust table and a Cards table - Each Cust has one or multiple cards. I want to SELECT <some fields> from Cust WHERE the largest Card Number for that customer is > 6999999. In other words, I don't want to select customers whose largest Card Number is under 7000000.

Here the filter is actually not on the customer record but on child data. Geoff Bomford suggested this code, which uses a subquery within the Where clause:

SELECT <some fields>
FROM Customer
WHERE CustomerID IN (
 Select DISTINCT CustomerID
 FROM cards
 WHERE CardNumber > 7000000
)

Jon Waterhouse offered this alternative syntax using where exists:

select < somefields>
  from customer c
  where exists (select 1 from cards a where a.customerid = c.customerid and cardnumber > 7000000)

suggesting that this version "gives the optimiser more choice over what to actually do depending on what it knows about the data. Dropping the distinct from Geoff's query would also likely be more efficient."

Dennis Evans replied:

Maybe, hard to predict without details on the indexes that are in use. But as a general rule of thumb in MS Sql server the IN operator in a sub-query is typically a better choice. In some cases the exists operator must be used, but often selects that use the exist operator can be written with the IN operator and some times using the ANY or ALL modifiers.

This may result in a repeating query because of the use of the customer id, can also happen when using the IN operator. If the inner query depends on a value from the outer query then performance will probably go south because the inner query will execute for each value.

The distinct could be removed, I suspect the optimizer will ignore the statement because it does not have real use in this case, but again without a bunch of details hard to be certain. The IN operator will produce a list of rows from the table first and then the outer query will use that list for the comparisons.

Note Dennis's comment about the inner query executing for each value. You might think that's automatically going to happen, but if the server thinks it will be faster to retrieve all or some of the Cards rows it'll do that instead.

There are varying opinions on "where in" vs "where exists", but as this StackOverflow post suggests there's a good chance the query optimizer will create the same execution plan in both cases.

I'm told that the thing to watch out for is the use of Not Exists in subqueries, as this will always result in the subquery executing for each row in the parent query. In this situation you're probably better off with an outer join.

Also keep the cache in mind - if subsequent reads come from the cache, you're not seeing the actual query performance. For testing you may want to disable the cache. Just don't ever do that in production.

Arnor reminded the group about the Database Engine Tuning Advisor.

For more information on subqueries check out the Subquery Fundamentals page on MSDN.