Wednesday, October 21, 2009

SQL Subqueries with Null Values

Null values in data can often cause unexpected results. Recently I came across a case where the field in a NOT IN subquery contained null values, and I didn't get the behaviour I was expecting. After spending some time with this, I have it worked out, and the behaviour does make sense. It's worth having a look at.

We commonly use a NOT IN subquery to retrieve records that do not have a related record in another table. The common example of this type of query is to retrieve all customers that do not have any orders. Here's a simple example of a Customers table and an Orders table:
Customers                Orders
CustID FullName OrderID CustID OrderDate
1 John Doe 1 2 1/1/2009
2 Jane Doe 2 3 1/2/2009
3 Jack Smith 3 3 1/3/2009
4 Jane Smith
There are four customers, and two of these customers have placed orders. We can use IN and NOT IN subqueries like these:

select * from Customers where CustID in (select CustID from Orders)
2 Jane Doe
3 Jack Smith


select * from Customers where CustID not in (select CustID from Orders)
1 John Doe
4 Jane Smith

This is as we would expect. However, let's add a fourth record to the Orders table, with a null value for the CustID:
OrderID   CustID    OrderDate
4 null 1/4/2009
Now, when we run the IN query, the results are unchanged; we still get customers 2 and 3. However, when we run the NOT IN query:

select * from Customers where CustID not in (select CustID from Orders)
No records returned


Why are there no records returned? Shouldn't we still be getting customers 1 and 4, since these CustIDs do not appear in the Orders table? Well, let's look at how this gets handled. The subquery generates a list of CustIDs, like this:

select * from Customers where CustID not in ( 2, 3, null )

Logically, the NOT IN is treated as a series of not equals expressions, like this:

select * from Customers where ( CustID <> 2 and CustID <> 3 and CustID <> null )

Now, we can consider how this evaluates for our customer records. For customer 1, the where clause becomes:

1 <> 2 and 1 <> 3 and 1 <> null
= true and true and null
= null

This is why record 1 doesn't appear in the result set; for records to appear, the where clause must evaluate to true, not to null.

Now that I've gone through the logic on this, it's not really correct to say that customers 1 and 4 don't have any orders. The null CustID value in the Orders table means that the customer for that order is unknown, so we can't guarantee that this order doesn't belong to customer 1 or 4.

No comments:

Post a Comment