Null values in data can often cause unexpected results. Recently I came across a case where the field in a NOT IN subquery contained null values, and I didn't get the behaviour I was expecting. After spending some time with this, I have it worked out, and the behaviour does make sense. It's worth having a look at.
We commonly use a NOT IN subquery to retrieve records that do not have a related record in another table. The common example of this type of query is to retrieve all customers that do not have any orders. Here's a simple example of a Customers table and an Orders table:
Customers Orders
CustID FullName OrderID CustID OrderDate
1 John Doe 1 2 1/1/2009
2 Jane Doe 2 3 1/2/2009
3 Jack Smith 3 3 1/3/2009
4 Jane Smith
There are four customers, and two of these customers have placed orders. We can use IN and NOT IN subqueries like these:
select * from Customers where CustID in (select CustID from Orders)
2 Jane Doe
3 Jack Smithselect * from Customers where CustID not in (select CustID from Orders)
1 John Doe
4 Jane Smith
This is as we would expect. However, let's add a fourth record to the Orders table, with a null value for the CustID:
OrderID CustID OrderDate
4 null 1/4/2009
Now, when we run the IN query, the results are unchanged; we still get customers 2 and 3. However, when we run the NOT IN query:
select * from Customers where CustID not in (select CustID from Orders)
No records returnedWhy are there no records returned? Shouldn't we still be getting customers 1 and 4, since these CustIDs do not appear in the Orders table? Well, let's look at how this gets handled. The subquery generates a list of CustIDs, like this:
select * from Customers where CustID not in ( 2, 3, null )Logically, the NOT IN is treated as a series of not equals expressions, like this:
select * from Customers where ( CustID <> 2 and CustID <> 3 and CustID <> null )Now, we can consider how this evaluates for our customer records. For customer 1, the where clause becomes:
1 <> 2 and 1 <> 3 and 1 <> null= true and true and null= nullThis is why record 1 doesn't appear in the result set; for records to appear, the where clause must evaluate to true, not to null.
Now that I've gone through the logic on this, it's not really correct to say that customers 1 and 4 don't have any orders. The null CustID value in the Orders table means that the customer for that order is unknown, so we can't guarantee that this order doesn't belong to customer 1 or 4.