The diagram below shows two data sets, with differences highlighted:
To find changed rows using T-SQL, we might write a query like this:
The logic is clear: Join rows from the two sets together on the primary key column, and return rows where a change has occurred in one or more data columns.
Unfortunately, this query only finds one of the expected four rows:
The problem is that our query does not correctly handle NULLs.
You might have noticed that January was a quiet blogging month for me.
Part of the reason was that I was working on an article for Simple Talk, looking at how parallel query execution really works. The first part is published today at:
It’s a curious thing about SQL that the SUM or AVG of no items (an empty set) is not zero, it’s NULL.
In this post, you’ll see how this means your SUM and AVG calculations might run at half speed, or worse. As usual though, this entry is not so much about the result, but the journey we take to get there.
There is much more to query tuning than reducing logical reads and adding covering nonclustered indexes. Query tuning is not complete as soon as the query returns results quickly in the development or test environments.
In production, your query will compete for memory, CPU, locks, I/O, and other resources on the server. Today’s post looks at some tuning considerations that are often overlooked, and shows how deep internals knowledge can help you write better T-SQL.
Is it possible to see LOB (large object) logical reads from STATISTICS IO output on a table with no LOB columns?
I was asked this question today by someone who had spent a good fraction of their afternoon trying to work out why this was occurring — even going so far as to re-run DBCC CHECKDB to see if corruption was the cause.
The table in question wasn’t particularly pretty. It had grown somewhat organically over time, with new columns being added every so often as the need arose.
Nevertheless, it remained a simple structure with no LOB columns — no text or image, no xml, no max types — nothing aside from ordinary integer, money, varchar, and datetime types.
To add to the air of mystery, not every query that ran against the table would report LOB logical reads — just sometimes — but when it did, the query often took much longer to execute.
A seek can contain one or more seek predicates, each of which can either identify (at most) one row in a unique index (a singleton lookup) or a range of values (a range scan).
When looking at an execution plan, we often need to look at the details of the seek operator in the Properties window to see how many operations it is performing, and what type of operation each one is.
As seen in the first post of this mini-series, When is a Seek not a Seek? the number of hidden seeking operations can have an appreciable impact on performance.
You might be most familiar with the terms ‘Seek’ and ‘Scan’ from the graphical plans produced by SQL Server Management Studio (SSMS). You might look to the SSMS tool-tip descriptions to explain the differences between them:
Both mention scans and ranges (nothing about seeks) and the Index Seek description maybe implies that it will not scan the index entirely (which isn’t necessarily true). Not massively helpful.
The following script creates a single-column clustered table containing the integers from 1 to 1,000 inclusive.
IF OBJECT_ID(N'tempdb..#Test', N'U')ISNOTNULLBEGINDROPTABLE#TestEND;
GO
CREATETABLE#Test(
id integerPRIMARYKEYCLUSTERED);INSERT#Test(id)SELECT
V.number
FROM master.dbo.spt_values AS V
WHERE
V.[type]= N'P'AND V.number BETWEEN1AND1000;
Let’s say we are given the following task:
Find the rows with values from 100 to 170, excluding any values that divide exactly by 10.
I saw a question asked recently on the #sqlhelp hash tag:
Might SQL Server retrieve (out-of-row) LOB data from a table, even if the column isn’t referenced in the query?
Leaving aside trivial cases like selecting a computed column that does reference the LOB data, one might be tempted to say that no, SQL Server does not read data you haven’t asked for.
In general, that is correct; however, there are cases where SQL Server might sneakily read a LOB column.
Brad Schulz recently wrote about optimizing a query run against tables with no indexes at all. The problem was, predictably, that performance was not very good. The catch was that we are not allowed to create any indexes (or even new statistics) as part of our optimization efforts.
In this post, I’m going to look at the problem from a different angle, and present an alternative solution to the one Brad found.
Myth: SQL Server Caches a Serial Plan with every Parallel Plan
Many people believe that whenever SQL Server creates an execution plan that uses parallelism, an alternative serial plan is also cached.
The idea seems to be that the execution engine then decides between the parallel and serial alternatives at runtime. I’ve seen this on forums, in blogs, and even in books.
In fairness, a lot of the official documentation is not as clear as it might be on the subject. In this post I will show that only a single (parallel) plan is cached. I will also show that SQL Server can execute a parallel plan on a single thread.
This post covers a little-known locking optimization that provides a surprising answer to the question:
If I hold an exclusive lock on a row, can another transaction running at the default read committed isolation level read it?
Most people would answer ‘no’, on the basis that the read would block when it tried to acquire a shared lock. Others might respond that it depends on whether the READ_COMMITTED_SNAPSHOT database option was in effect, but let’s assume that is not the case, and we are dealing simply with the default (locking) read committed isolation level.
It is frequently useful to generate sequences of values within SQL Server, perhaps for use as surrogate keys. Using the IDENTITY property on a column is the easiest way to automatically generate such sequences:
Sometimes though, the database designer needs a more flexible scheme than is provided by the IDENTITY property. One alternative is to use a Sequence Table.
If you look up Table Hints in the official documentation, you’ll find the following statements:
If a clustered index exists, INDEX(0) forces a clustered index scan and INDEX(1) forces a clustered index scan or seek.
If no clustered index exists, INDEX(0) forces a table scan and INDEX(1) is interpreted as an error.
The interesting thing there is that both hints can result in a scan. If that is the case, you might wonder if there is any effective difference between the two.
This blog entry explores that question, and highlights an optimizer quirk that can result in a much less efficient query plan when using INDEX(0). I’ll also cover some stuff about ordering guarantees.
A detailed look at costing, and more undocumented optimizer fun.
The SQL Server query optimizer generates a number of physical plan alternatives from a logical requirement expressed in T-SQL. If full cost-based optimization is required, a cost is assigned to each iterator in each alternative plan, and the plan with the lowest overall cost is ultimately selected for execution.
When you write a query to return the first few rows from a potential result set, you’ll often use the TOP clause.
To give a precise meaning to the TOP operation, it will normally be accompanied by an ORDER BY clause. Together, the TOP…ORDER BY construction can be used to precisely identify which top ‘n’ rows should be returned.
You might recall from Inside the Optimizer: Row Goals In Depth that query plans containing a row goal tend to favour nested loops or sort-free merge join over hashing.
This is because a hash join has to fully process its build input (to populate its hash table) before it can start probing for matches on its other input. Hash join therefore has a high start-up cost, balanced by a lower per-row cost once probing begins.
In this post, we will take a look at how row goals affect grouping operations.
One of the core assumptions made by the SQL Server query optimizer cost model is that clients will eventually consume all the rows produced by a query.
This results in plans that favour the overall execution cost, though it may take longer to begin producing rows.
From time to time, I encounter a system design that always issues an UPDATE against the database after a user has finished working with a record — without checking to see if any of the data was in fact altered.
The prevailing wisdom seems to be “the database will sort it out”. This raises an interesting question: How smart is SQL Server in these circumstances?
In this post, I’ll look at a generalisation of this problem: What is the impact of updating a column to the value it already contains?
The specific questions I want to answer are:
Does this kind of UPDATE generate any log activity?
Do data pages get marked as dirty (and so eventually get written out to disk)?
Iterators, Query Plans, and Why They Run Backwards
Iterators
SQL Server uses an extensible architecture for query optimization and execution, using iterators as the basic building blocks.
Iterators are probably most familiar in their graphical showplan representation, where each icon represents a single iterator. They also show up in XML query plan output as RelOp nodes:
Each iterator performs a single simple function, such as applying a filtering condition, or performing an aggregation. It can represent a logical operation, a physical operation, or (most often) both.
The optimizer has pushed the predicate ProductNumber LIKE 'T%' down from a Filter to the Index Scan on the Product table, but it remains as a residual predicate.