Undocumented Query Plans: Equality Comparisons

Wednesday, 22 June 2011

Undocumented Query Plans: Equality Comparisons

The diagram below shows two data sets, with differences highlighted:

Two data sets with differences highlighted

To find changed rows using T-SQL, we might write a query like this:

Query to find changed rows

The logic is clear: Join rows from the two sets together on the primary key column, and return rows where a change has occurred in one or more data columns.

Unfortunately, this query only finds one of the expected four rows:

Only one changed row found

The problem is that our query does not correctly handle NULLs.

The ‘not equal to’ operators <> and != do not evaluate to TRUE if one or both of the expressions concerned is NULL.

In this example, that statement is always true because we are comparing column references. In other circumstances, the behaviour might depend on the ANSI_NULLS setting. I am not going to go into that here, partly because new code should be written to assume SET ANSI_NULLS ON:

ANSI_NULLS deprecation warning

One obvious way to handle the NULLs in our example is to write all the conditions out in full:

Query with a full set of comparisons

That query produces the correct results (only the row with PK = 4 is identical in both sets):

Correct result set

Even with just three columns to check, the query is already getting quite long. It is also quite easy to miss one of the combinations or to misplace a parenthesis.

In an attempt to reduce the size of the query, we might be tempted to use ISNULL or COALESCE:

Shortened code using COALESCE

The idea there is to replace any NULLs in the data with a particular non-null value that allows the comparison to work as we expect.

This produces the correct result (with the test data) but there are a number of reasons to dislike this approach.

For one thing, we have to be careful to choose a special replacement value that can never appear in the real data, now or at any time in the future. Taking this approach is no different, in principle, from choosing not to store NULLs in the first place, and using the chosen special value instead.

Leaving the ‘to null or not to null’ debate to one side, another issue is that COALESCE returns a value typed according to the input expression that has the highest data type precedence.

Many times, this will not matter, but it is possible to introduce subtle bugs this way. Using ISNULL instead would avoid this issue by ensuring that the data type of the first expression is used, but the problem of choosing appropriate special values remains.

The final objection I have to this method is a bit more subjective: although the query looks simpler than before, COALESCE is just shorthand for a CASE expression.

Let’s compare the execution plans for the two queries:

Execution plan for conditions written out in full

That’s the plan for the query with all the conditions written out in full. The Filter operator contains this mess:

Filter predicate

The COALESCE form execution plan looks like this:

Plan for COALESCE query

Now, the complexity is split between two operators. The Compute Scalar contains the definitions shown below on the left, and the Clustered Index Seek contains the residual predicate shown on the right:

COALESCE plan operator properties

Using ISNULL instead makes some difference — the graphical plan is visually identical to that obtained by using COALESCE, but the defined values and residual predicate are somewhat more concise:

ISNULL plan operator properties

An Alternative Syntax

We are looking for rows that join based on the value of the PK column, but which contain a difference in at least one column.

Another way to state that is to say:

For each row that joins, the intersection of the two rows should be empty.

If the two rows are identical the intersection of the two rows will be a single row. Conversely if the two rows are different in any way, the intersection will be empty.

Writing that logic in T-SQL results in this query form:

Query written using INTERSECT syntax

The query accounts for NULLs correctly, and produces the correct result. The query plan looks like this:

Execution plan for INTERSECT query form

There are no surprises in the Clustered Index Scan, Clustered Index Seek, or Inner Join. In particular, none of these operators define any expressions or apply any residual predicates.

The seek is the one we would expect: It seeks to the row with the matching PK column value.

Looking at the Constant Scan reveals nothing, literally. This operator produces a single in-memory row with no columns. There really is nothing to see there, so we will move on to the last remaining operator: the Left Anti Semi Join.

If you were expecting to see complex logic similar to the CASE expressions seen earlier, prepare for a disappointment. The anti semi join contains the following:

Anti semi join predicate

Aside from a (redundant) check that the two PK columns match, this predicate just checks that all three data column values match, using an equality (=) comparison. (As a side note, we can avoid the redundant check on PK values by specifying just the data columns in the INTERSECT sub-query, rather than using the star * syntax).

To see how this works, consider that a left anti semi join passes rows through where no row is found on the inner input.

In this case, a row (with no columns) is always provided by the Constant Scan, but the predicate shown above is applied to it before the anti semi join decides whether to pass the source row on or not.

If all the conditions evaluate to TRUE, the no-column row from the Constant Scan is retained. The anti semi join finds that row on its inner input, and the source row does not pass on to the query output.

The net effect is that if the two rows match in all columns, no row appears on the output. In all other cases (where at least one difference exists) the current row is returned by the execution plan.

This is the correct semantic, so the query returns the correct result.

NULL handling trickery

At this point, you might be wondering how this query plan manages to handle NULLs correctly.

Consider the rows in the source tables with PK = 4. Both rows are identical, but only if the NULLs present in the ival column compare equal.

The relevant part of the predicate shown above is:

ival comparison

In the case where both columns contain NULL, we would expect this equality comparison to return UNKNOWN, not the TRUE needed to ensure that the anti semi join does not pass the source row to the output.

In other words, unless this equality comparison is doing something unusual, we would expect the query to incorrectly return a row for PK = 4 because the NULLs in the ival column should not compare equal.

The reason it works lies in the way INTERSECT handles NULLs. According to the documentation:

INTERSECT documentation extract

That explains why the INTERSECT query produces correct results, but it does not say how the query plan achieves this.

Before we see the details of that, let’s look at what happens if we try to write the query using the same logic as the INTERSECT plan:

Manual implementation of the INTERSECT plan

That query produces the exact same graphical query plan as the INTERSECT query:

Execution plan for manual INTERSECT

Even the predicate on the Anti Semi Join is identical:

Anti semi join predicate

Despite the identical plan, this new query produces wrong results! It includes the row with PK = 4 in the output, due to the problem comparing the NULLs in those rows:

Faulty output

The Answer

Although the graphical query plan (and even the extra detail available in the SSMS Properties window) shows no difference between the INTERSECT and NOT EXISTS queries, there is a difference — one that implements the different comparison semantics involved.

In the INTERSECT form, the equality comparison must compare two NULLs as equal. In the NOT EXISTS form, we are using a regular = comparison, one that should return UNKNOWN when two NULLs are compared.

To see this difference, we have to look deeper into the execution plan than the graphical form or Properties window can take us.

Inspecting the XML behind the graphical plan, we see the following logic in both cases for the test on the PK column values:

XML show plan EQ comparison

Notice the compare operation is EQ — a test for equality between the two column references. The EQ test does not return true for NULLs.

In the NOT EXISTS form of the query, the other columns are compared in exactly the same way, using EQ comparisons. For example, this is the test on the ival column:

ival column EQ comparison

Now look at the XML for the ival column comparison in the INTERSECT query:

ival column IS comparison

Now the compare operation is shown as IS instead of EQ.

This is the reason that NULLs compare equal in the INTERSECT test. It is using the comparison semantic familiar to T-SQL users from expressions like WHERE x IS NULL.

This is the SQL language IS DISTINCT FROM feature —implemented in the query processor, but not yet available in the T-SQL language.

I would like to see the internal language of the query processor exposed as an alternative to T-SQL. The query processor’s query specification language is much richer and more expressive than T-SQL, and would not be bound by some of the odd behaviours maintained in T-SQL for backward compatibility. I live in hope, but I’m not holding my breath…

Test script

USE Sandpit;

DECLARE @Set1 table
(
    pk bigint PRIMARY KEY,
    ival integer NULL,
    cval char(1) NULL,
    mval money NULL
);

DECLARE @Set2 TABLE
(
    pk bigint PRIMARY KEY,
    ival integer NULL,
    cval char(1) NULL,
    mval money NULL
);

INSERT @Set1
    (pk, ival, cval, mval)
VALUES 
    (1, 1000, 'a', $1),
    (2, NULL, 'b', $2),
    (3, 3000, 'c', NULL),
    (4, NULL, 'd', $4),
    (5, 5000, 'e', $5);

INSERT @Set2
    (pk, ival, cval, mval)
VALUES
    (1, 1000, 'a', NULL),
    (2, 2000, 'b', $2),
    (3, NULL, 'c', $3),
    (4, NULL, 'd', $4),
    (5, 5999, 'z', $5);

-- Incorrect results, doesn't account for NULLs
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    s.ival <> t.ival
    OR s.cval <> t.cval
    OR s.mval <> t.mval;

-- Correct, but verbose and error-prone 
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    s.ival <> t.ival
    OR (s.ival IS NULL AND t.ival IS NOT NULL)
    OR (s.ival IS NOT NULL AND t.ival IS NULL)
    OR s.cval <> t.cval
    OR (s.cval IS NULL AND t.cval IS NOT NULL)
    OR (s.cval IS NOT NULL AND t.cval IS NULL)
    OR s.mval <> t.mval
    OR (s.mval IS NULL AND t.mval IS NOT NULL)
    OR (s.mval IS NOT NULL AND t.mval IS NULL);

-- COALESCE: Correct results, but problematic
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    COALESCE(s.ival, -2147483648) <> COALESCE(t.ival, -2147483648)
    OR COALESCE(s.cval, '¥') <> COALESCE(t.cval, '¥')
    OR COALESCE(s.mval, $-922337203685477.5808 ) <> COALESCE(t.mval, $-922337203685477.5808);

-- ISNULL: Correct results, but problematic
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    ISNULL(s.ival, -2147483648) <> ISNULL(t.ival, -2147483648)
    OR ISNULL(s.cval, '¥') <> ISNULL(t.cval, '¥')
    OR ISNULL(s.mval, $-922337203685477.5808 ) <> ISNULL(t.mval, $-922337203685477.5808);

-- INTERSECT:
-- Correct results in a compact form
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    NOT EXISTS (SELECT s.* INTERSECT SELECT t.*);

-- NOT EXISTS:
-- Same query plan, but wrong results!
SELECT 
    *
FROM @Set2 AS s
JOIN @Set1 AS t ON 
    t.pk = s.pk
WHERE 
    NOT EXISTS 
    (
        SELECT 1
        WHERE 
            t.pk = s.pk
            AND t.ival = s.ival
            AND t.cval = s.cval
            AND t.mval = s.mval
    );

12 comments:

John30 July 2020 at 11:05
Paul,

I ran across this accidentally almost, after years of using HASHBYTES with the XML clause as a poor man's row change identifier. The internet is littered with struggles in this area. I can't believe I JUST ran across it. I'm surprised this isn't linked to more often.

Considering this was the 2011 time frame, is there anything new(er) you've written on the subject or seen that you tend to lean more towards today?

Thanks for taking the time out to post this.

--J
ReplyDelete
Replies
Unknown1 August 2020 at 01:51
I remember using COALESCE / ISNULL in such cases, whilte this trick is briefer in syntax and has better performance. Neat little gem to be added to one's knowledgebase.
ReplyDelete
Replies
zenzei16 September 2021 at 04:58
Hi,
For this problem, I use the EXCEPT clause. Is there anything fundamentally wrong with it? Seems the simplest solution to me, but I might be wrong:

SELECT * FROM @Set1 AS t EXCEPT SELECT * FROM @Set2 AS s

If you want the exact same results as in your query (i.e. columns from both tables), you could write a CTE:

WITH DIFF as (SELECT * FROM @Set1 AS t EXCEPT SELECT * FROM @Set2 AS s) --- DIFFERENT ROWS
SELECT t.*,s.* FROM @Set1 AS t JOIN @Set2 AS s ON s.pk = t.pk --- JUST THE COLUMNS YOU WANTED
JOIN DIFF on DIFF.pk = t.pk --- ONLY KEEP THE DIFFERENT ROWS

The execution plan to have the identical results seems longer (3 nested loops), but as there are very few rows, it is not really indicative of how it scales with larger datasets. The plan does the left-anti-join loop first.

Would you prefer one solution or the other?

Thanks
ReplyDelete
Replies
zenzei23 September 2021 at 05:47
Hi Paul,
I was mainly comparing the short versions:

SELECT * FROM @Set1 AS t EXCEPT SELECT * FROM @Set2 AS s --seems quite clean and short...

versus

SELECT t.* --only keeping columns from Set1
FROM @Set1 AS t
JOIN @Set2 AS s ON s.pk = t.pk
WHERE NOT EXISTS (SELECT s.* INTERSECT SELECT t.*);

When comparing these two, in the short examples you provided, the except version seemed to be shorter to write and with a simpler plan.

But of course, life is all about trying with real volumes, so I tested the above scenario (keeping only columns from Set1) with larger tables.

Three queries to compare:
1 - Your query with WHERE...NOT EXISTS... INTERSECT...
2 - Same query with WHERE...EXISTS...EXCEPT
3 - Simple query only with EXCEPT

I then created two #set1, #set2 tables with 13M rows, with the following characteristics:

Test 1: HEAP tables
Test 2: clustered columnstore indexed

Results:
- Queries 1 and 2 run much faster than Query 3 (the simple except) - couple of seconds vs 9 seconds (test1) or 20 seconds (test2). Huge difference.
- Query 1 is still faster than Query 2 (1 second vs 2 seconds). Queries 1 and 2 generate different plan. Query2 has one more nested loop
- Clustered columnstore makes Query3 much slower (needs a sort), while it still is fast on Query 1 and 2.

So, as I was secreatly really expecting, your plan is better and much faster.

Now, there is only one caveat: Queries 1 and 2 rely on having an *unique* PK (or a combination of columns that act as such that you can put in the join).
If there is a duplicate or nulls in the joining column(s), Queries 1 and 2 will report both rows as being different, while Query3 (the one using only Except) will correctly only report the differences.

So Query 3 is more generic for these cases, but if you have PK the best version by far is the one you provided.

Hope this is useful, thanks for all your blogging and writing, I learnt a lot from them.
ReplyDelete
Replies
Erland Somarskog12 December 2021 at 10:06
It seems that the IS operator is only exhibited in the plan if the comparison is between two columns, not between a column and a variable. I get seemingly identical plans for the three variations below. The result is nevertheless not the same for the first one as the latter two:

DROP TABLE IF EXISTS DemoTable
CREATE TABLE DemoTable(a int NOT NULL,
b int NOT NULL,
c int NULL,
d int NULL,
CONSTRAINT pk_DemoTable PRIMARY KEY (a, b),
INDEX c_d_ix NONCLUSTERED (c, d)
)
go
INSERT DemoTable (a, b, c, d)
SELECT object_id, column_id, iif(max_length = -1, NULL, abs(checksum(*) % 200)),
IIf(len(name) = 8, NULL, abs(checksum(*) % 100))
FROM sys.columns
go
CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM DemoTable
WHERE c = @c
AND d = @d
go
EXEC GetDataByCD NULL, NULL
go
CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM DemoTable
WHERE (c = @c OR c IS NULL AND @c IS NULL)
AND (d = @d OR d IS NULL AND @d IS NULL)
go
EXEC GetDataByCD NULL, NULL
go
CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM DemoTable
WHERE EXISTS (SELECT c, d INTERSECT SELECT @c, @d)
go
EXEC GetDataByCD NULL, NULL
ReplyDelete
Replies
Paul White12 December 2021 at 20:45
Hi Erland,

It's not so much what the comparison is between (column, variable, parameter...), rather it is the accuracy of the showplan representation of the internal plan.

The comparison is always correctly IS or EQ depending on the requested semantic.

In your case, it is the way the seek predicate comparison is rendered. Remove the nonclustered index to see EQ or IS in showplan. Alternatively, add OPTION (QUERYTRACEON 3604, QUERYTRACEON 8607) to the statement in the procedure to see ScaOp_Comp x_cmpIs or x_cmpEq in the optimizer output.

Cheers,

Paul
ReplyDelete
Replies