Iterators
SQL Server uses an extensible architecture for query optimization and execution, using iterators as the basic building blocks.
Iterators are probably most familiar in their graphical showplan representation, where each icon represents a single iterator. They also show up in XML query plan output as RelOp
nodes:
Each iterator performs a single simple function, such as applying a filtering condition, or performing an aggregation. It can represent a logical operation, a physical operation, or (most often) both.
For example, Aggregate is a logical operation. Stream Aggregate and Hash Aggregate are physical operations. Similarly, Inner Join is a logical operation, and Nested Loops is a physical one.
Both graphical plans and XML showplan output show these properties:
Logical iterators can be combined in complex ways to satisfy any requirement that can be expressed in SQL. The physical iterators combine to produce an executable tree used by the execution engine to produce results matching the logical query specification.
The logical iterators represent the original query translated into relational algebra. The physical iterators represent a physical algebra; the concrete algorithms used by the SQL Server storage engine and query processor to perform a logical task.
Those of you that followed my series of posts on optimizer internals may remember that implementation rules in the query optimizer convert a logical requirement to one of the possible physical implementations.
Each round of exploration rules expand the search space by applying logically-equivalent transformations, before implementation rules run to introduce physical iterators. Once that’s done, cost-based decisions can be made to select a final plan for execution.
A Common Interface and Design
Iterators are code objects, which share a common interface of methods and properties. The most frequently used of these are the methods used to process rows, set and retrieve property information, and produce a cost estimate for the optimizer.
These common features (and the abstractness of the design) enable easier extensibility as new database and language features are introduced. It also allows iterators to be connected in a tree structure without any of them needing to know much about the other iterator(s) it might be connected to.
The executable code that forms the core of the iterator implementation is also written to avoid hard-wired dependencies on the parent and child iterators in the plan tree where possible. This avoids the need to recompile the iterator base code each time it is incorporated in a new plan. Those of you that have some programming experience will recognise this arrangement as one that uses virtual function pointers.
All iterators require a small fixed amount of memory to store state information (think local variables) while they are executing. This memory forms part of the compiled plan, so a formal memory grant is not required each time a plan is executed. The cached size of a plan includes this memory allocation.
The common methods most relevant to the present discussion are those used to process a set of rows. These are the Open
or Init
, GetNext
, and Close
methods. These names might give you a clue as to why query plan nodes are known as iterators.
Plan Execution
Let’s look at a simple plan (the query itself isn’t important):
Have you ever wondered how a plan like that is executed? Where does it start, and how does it know when it has finished?
Judging from the lines and arrows that connect the iterators in a graphical plan, we might assume that execution starts with the data-producing iterators (the scans) and continues until they run out of rows.
The arrow heads on the connecting lines do indicate the direction of data flow, but the iterator execution order is quite different. Query execution always begins with the root iterator — the hash join in the diagram above.
The execution engine only directly interacts with this root iterator. It calls its Open
method to prepare it to return data, and then repeatedly calls its GetNext
method to retrieve one row of query output at a time.
As soon as the GetNext
method call fails to return a new row, the engine calls the Close
method, and query execution ends. Simple as that.
The Ultimate Cursors
Ok, so there’s a bit more to it than that. Let’s see what happens when the Open
method is called on the hash join iterator:
- The hash join calls the
Open
method on its build input. - The
Open
method of the Scan iterator prepares the storage engine to start producing rows from theProduct
table. Control then returns to the hash join (which is still running itsOpen
method, remember). - The hash join repeatedly calls the
GetRow
method on the Scan iterator. - Each
GetRow
call returns a single row from the Scan to the hash join. These rows are used to build the join hash table. - As soon as a call to
GetRow
on theProduct
table Scan reports that there are no more rows, the hash join calls theClose
method on the Scan. - The Scan iterator performs its clean-up tasks, and control again returns to the hash join.
- The hash join now calls
Open
on its inner input (the probe input). - The
Inventory
table Scan iterator prepares its connection to the storage engine, and returns control to the hash join. - The hash join returns from its
Open
method call (finally!) and control returns to the execution engine.
The execution engine now repeatedly calls GetRow
on the hash join, to return the query results:
- The hash join calls
GetRow
on its probe input. - The Scan’s
GetRow
method returns a single row from theInventory
table. - The hash join performs the join using its hash table.
- If the row does not join, go back to step 1.
- If the row does join, return it to the engine.
This process continues until the probe input Scan reports that no more rows are available. The hash join then also returns from its GetRow
method without producing a row. When that happens, the execution engine knows that the query is complete, and calls the Close
method on the hash join, to allow it to clean up:
- The hash join calls
Close
on its probe input. - The Scan performs its clean-up tasks and returns control to the hash join.
- The hash join deallocates its hash table, and returns from the
Close
method.
This row-by-row processing — calling the GetNext
method repeatedly — is why query plan nodes are referred to as iterators: They iterate over a set of input rows, returning one row at a time to its parent.
Execution of a query plan starts at the root, and flows down the plan via a series of nested method calls. Data is pulled up the plan in the reverse direction: from the leaves toward the root, a row at a time. Query plans do run backward 🙂
Inside an Iterator
Exactly what is done in the Open
, GetNext
, and Close
methods depends on the type of iterator, and the mode in which it is running.
The following table gives a broad outline of the work performed by different iterators when these methods are called:
There is a lot more to say about query plan execution and iterators, so I will return to this topic in a future post.
© Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi
Nice post thank you Ashlee
ReplyDeleteNice post thank you Seth
ReplyDelete