About This Blog

Including my content originally published on 𝕏, SQLperformance.com, and SQLblog.com

Wednesday, 28 July 2010

The “Segment Top” Query Optimization

The Segment Top Query Optimization

A question that often comes up on the forums is how to get the first or last row from each group of records in a table. This post describes a clever query plan optimisation that SQL Server can use for these types of query.

As a simple example, based on the AdventureWorks sample database, say we need to find the minimum product quantity stored in each bin in the warehouse. Using the Production.ProductInventory table, we can write a simple aggregate query:

SELECT
    INV.Shelf,
    INV.Bin,
    min_qty = MIN(INV.Quantity)
FROM Production.ProductInventory AS INV
GROUP BY
    INV.Shelf,
    INV.Bin
ORDER BY
    INV.Shelf,
    INV.Bin;

Let’s create a covering index too:

CREATE NONCLUSTERED INDEX nc1
ON Production.ProductInventory (Shelf, Bin, Quantity);

As you might expect, that produces a nice simple query plan:

Simple Query Plan

It also produces the results we were after (just a small sample is shown):

Results

Great! But what if we would like some additional information in the output? You might reasonably want to see some details about the product which has the minimum quantity found - maybe the ProductID and LocationID. We can’t just add the extra columns into the query; if we did, SQL Server would complain with an error like:

Column ‘Production.ProductInventory.ProductID’ is invalid in the
select list because it is not contained in either an aggregate
function or the GROUP BY clause.

If we try to work around that by adding the extra columns to the GROUP BY clause, the query runs, but produces the wrong results.

Perhaps we can take the query results and join them back to the source table? We do not need to use a temporary table here, we can use a Common Table Expression (CTE):

WITH OriginalQuery AS
(
    SELECT  
        INV.Shelf,
        INV.Bin,
        min_qty = MIN(INV.Quantity)
    FROM Production.ProductInventory AS INV
    GROUP BY
        INV.Shelf,
        INV.Bin
)
SELECT
    INV.ProductID,
    INV.LocationID,
    INV.Shelf,
    INV.Bin,
    INV.Quantity
FROM OriginalQuery AS OQ
JOIN Production.ProductInventory AS INV
    ON INV.Shelf = OQ.Shelf
    AND INV.Bin = OQ.Bin
    AND INV.Quantity = OQ.min_qty
ORDER BY
    INV.Shelf,
    INV.Bin;

That’s quite a lot more T-SQL code, so we might be expecting a complex query plan, and probably not a very efficient one either. After all, we introduced a join on a subquery containing an aggregate.

This is the entire query plan:
Segment Top plan

Somehow, the query optimizer converted all that code to just three query plan iterators: an Index Scan, a Segment, and a Top. How does that work?

T-SQL is, for the most part, a declarative language. It provides a way for the query writer to logically describe the results required; it is up to the optimizer to produce an efficient physical plan that the Query Processor can execute.

The optimizer is smart enough, on this occasion, to work out what we are logically trying to achieve with all that code: We want to split the table into groups, and return some information from the first row in every group.

The Index Scan produces rows in sorted order (by Shelf and then by Bin). The Segment iterator (covered in depth in my next post) detects when the beginning of a new group arrives from the index scan, and passes that information on to the Top iterator. The Top iterator then just returns the required columns from the first row in each group - easy!

I refer to this optimizer simplification as “Segment Top”, because of the way those two iterators co-operate to get the work done. The proper name for the transformation is “Generate Group By Apply – Simple” (GenGbApplySimple), but I know which one I prefer.

Impressively, the estimated execution cost for the original query was about 0.007. For the “Segment Top” plan, the estimated cost is about 0.006 - slightly less! We added heaps of extra T-SQL, and ended up with a ‘better’ plan. I’m comparing estimated costs here because those are calculated based on the information available to the optimizer. All things being equal, a plan with a lower estimated cost will be preferred.

There is more than one way to write this query to produce the exact same “Segment Top” plan. This may not surprise you, since it is frequently possible to express the same logical requirement with quite different T-SQL syntax. The following code illustrates this point:

SELECT
    INV1.ProductID,
    INV1.LocationID,
    INV1.Shelf,
    INV1.Bin,
    INV1.Quantity
FROM Production.ProductInventory AS INV1
WHERE
    INV1.Quantity = 
    (
        -- Correlated to the outer
        -- query on Shelf and Bin
        SELECT
            min_qty = MIN(INV2.Quantity)
        FROM Production.ProductInventory AS INV2
        WHERE
            INV2.Shelf = INV1.Shelf
            AND INV2.Bin = INV1.Bin
    )
ORDER BY
    INV1.Shelf,
    INV1.Bin,
    INV1.Quantity;

Astute readers might have noticed a potential problem there. What if there is more than one product in a particular bin which has the minimum quantity? Say if a particular bin contains two spanners, two screwdrivers, and four hammers. Surely the Top iterator in the Segment Top plan will only produce one row per bin, where it should return both spanners and screwdrivers?

The answer is that the Top iterator runs in WITH TIES mode. If you hover the mouse over the Top iterator in the graphical query plan, you will see that it has a ‘Tie Columns’ argument, over the Shelf and Bin columns. In this mode, if the Segment iterator indicates that more than one row ties for first place, the Top will return all of them - so both spanners and screwdrivers would be returned.

Some query designers might prefer to write a query using a ranking function like ROW_NUMBER; however, because we should return all rows that tie for first place, we have to be careful to use DENSE_RANK instead:

WITH Ranked AS
(
    -- Add the ranking column
    SELECT
        *,
        rn = DENSE_RANK() OVER (
            PARTITION BY Shelf, Bin 
            ORDER BY Quantity)
    FROM Production.ProductInventory AS INV
)
SELECT
    RK.ProductID,
    RK.LocationID,
    RK.Shelf,
    RK.Bin,
    RK.Quantity
FROM Ranked AS RK
    --      We want the first row(s)
    --      in every group
    WHERE
        RK.rn = 1
    ORDER BY
        RK.Shelf,
        RK.Bin,
        RK.Quantity;

That produces the following query plan, with an estimated cost of 0.0065 - slightly more than the “Segment Top”, but still less than the original query that did not produce the extra columns we need.

Rank execution plan

There are two Segment iterators in that plan, and I will explain why in my next post.

One final alternative solution uses the APPLY operator (see my article on SQL Server Central). The idea here is to explicitly find the first row(s) for each unique combination of Shelf and Bin:

WITH Groups AS
(
    -- Find the distinct combinations of
    -- Shelf and Bin in the table
    SELECT DISTINCT
        INV1.Shelf,
        INV1.Bin
    FROM Production.ProductInventory AS INV1
)
SELECT
    iTVF.ProductID,
    iTVF.LocationID,
    iTVF.Shelf,
    iTVF.Bin,
    iTVF.Quantity
FROM Groups
CROSS APPLY
(
    -- Find the first row(s)
    -- for each group
    SELECT TOP (1) WITH TIES
        INV2.*
    FROM Production.ProductInventory AS INV2
    WHERE
        INV2.Shelf = Groups.Shelf
        AND INV2.Bin = Groups.Bin
    ORDER BY
        INV2.Quantity ASC
) AS iTVF
ORDER BY
    iTVF.Shelf,
    iTVF.Bin,
    iTVF.Quantity;

This is the query plan:

APPLY execution plan

The estimated cost for this plan is 0.084 - much worse than the other methods, primarily because the join cannot be eliminated in this case. Nevertheless, the APPLY plan might be the optimal choice if a list of distinct bins and shelves is available from another table (eliminating the DISTINCT) and if there are relatively few shelf and bin combinations (so limiting the number of index seeks).

The transformation to “Segment Top” is not always the optimal strategy, and there’s currently no way to directly request it in a T-SQL query. Nevertheless, it is a useful, interesting, and efficient optimizer simplification.

© Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

No comments:

Post a Comment

All comments are reviewed before publication.