About This Blog

Including my content originally published on 𝕏, SQLperformance.com, and SQLblog.com

Wednesday, 20 March 2013

The Problem with Window Functions and Views

The Problem with Window Functions and Views

Introduction

Since their introduction in SQL Server 2005, window functions like ROW_NUMBER and RANK have proven to be extremely useful in solving a wide variety of common T-SQL problems. In an attempt to generalize such solutions, database designers often look to incorporate them into views to promote code encapsulation and reuse.

Unfortunately, a limitation in the SQL Server query optimizer often means that views1 containing window functions do not perform as well as expected. This post works through an illustrative example of the problem, details the reasons, and provides a number of workarounds.

Note: The limitation described here was first fixed in SQL Server 2017 CU 30. Optimizer fixes must be enabled using trace flag 4199 or the database scoped configuration option. The fix is standard behaviour without optimizer hotfixes under compatibility level 160 (SQL Server 2022).

Window functions

Window functions are distinguished by the presence of an OVER() clause and come in three varieties:

  1. Ranking window functions
    • ROW_NUMBER
    • RANK
    • DENSE_RANK
    • NTILE
  2. Aggregate window functions
    • MIN, MAX, AVG, SUM
    • COUNT, COUNT_BIG
    • CHECKSUM_AGG
    • STDEV, STDEVP, VAR, VARP
  3. Analytic window functions
    • LAG, LEAD
    • FIRST_VALUE, LAST_VALUE
    • PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC
    • CUME_DIST

The ranking and aggregate window functions were introduced in SQL Server 2005, and extended in SQL Server 2012. The analytic window functions were new in SQL Server 2012.

All of the window functions listed above are susceptible to the optimizer limitation detailed in this article.

Example

Using the AdventureWorks sample database, the task at hand is to write a query that returns all product number 878 transactions that occurred on the most recent date available.

There are all sorts of ways to express this requirement in T-SQL, but we will choose to write a query using a windowing function. The first step is to find transaction records for product 878 and rank them in date order descending:

SELECT
    TH.TransactionID,
    TH.ReferenceOrderID,
    TH.TransactionDate,
    TH.Quantity,
    rnk = RANK() OVER (
        ORDER BY TH.TransactionDate DESC)
FROM Production.TransactionHistory AS TH
WHERE
    TH.ProductID = 878
ORDER BY
    rnk;

Sample output for product 878

Original execution plan

The results of the query are as expected, showing the six transactions occurring on the most recent date available. The execution plan contains a warning triangle, alerting us to a missing index:

Missing index

As usual for missing index suggestions, we need to remember that the recommendation is not the result of a thorough analysis of the query. It is an indication that we need to think a bit about how this query accesses the data it needs:

The suggested index would certainly be more efficient than scanning the table completely. It would allow an index seek to the particular product we are interested in, and cover all the columns needed. It would not however avoid the sort (by TransactionDate descending).

An ideal index for this query would allow a seek on ProductID, return the selected records in reverse TransactionDate order, and cover the other returned columns:

CREATE NONCLUSTERED INDEX ix
ON Production.TransactionHistory
    (ProductID ASC, TransactionDate DESC)
INCLUDE 
    (ReferenceOrderID, Quantity);

With that index in place, the execution plan is much more efficient. The clustered index scan has been replaced by a seek (with range scan). An explicit sort is no longer necessary:

Optimal execution plan

The final step for this query is to limit the results to just those rows that rank highest. We cannot filter directly in the WHERE clause of our query because window functions may only appear in the SELECT and ORDER BY clauses.

We can workaround this restriction using a derived table, common table expression, function, or view. On this occasion, we will use a common table expression (aka an inline view):

WITH
    RankedTransactions AS
    (
        SELECT
            TH.TransactionID,
            TH.ReferenceOrderID,
            TH.TransactionDate,
            TH.Quantity,
            rnk = RANK() OVER (
                ORDER BY TH.TransactionDate DESC)
        FROM Production.TransactionHistory AS TH
        WHERE
            TH.ProductID = 878
    )
SELECT
    RT.TransactionID,
    RT.ReferenceOrderID,
    RT.TransactionDate,
    RT.Quantity
FROM RankedTransactions AS RT
WHERE
    RT.rnk = 1;

The execution plan is the same as before, with an extra Filter to return only rows ranked #1:

Rank 1 query plan

Filter properties

The query returns the six top-ranked rows we expect:

Six row query result

Generalizing the query

It turns out that our query is very useful, so the decision is taken to generalize it and store the definition in a view. For this to work for any product, we need to do two things: return the ProductID from the view, and partition the ranking function by product:

CREATE VIEW dbo.MostRecentTransactionsPerProduct
WITH SCHEMABINDING
AS
SELECT
    SQ1.ProductID,
    SQ1.TransactionID,
    SQ1.ReferenceOrderID,
    SQ1.TransactionDate,
    SQ1.Quantity
FROM 
(
    SELECT
        TH.ProductID,
        TH.TransactionID,
        TH.ReferenceOrderID,
        TH.TransactionDate,
        TH.Quantity,
        rnk = RANK() OVER (
            PARTITION BY TH.ProductID
            ORDER BY TH.TransactionDate DESC)
    FROM Production.TransactionHistory AS TH
) AS SQ1
WHERE
    SQ1.rnk = 1;

Selecting all the rows from the view results in the following execution plan and correct results (for all products):

Full view query plan

View sample results

We can now find the most recent transactions for product 878 with a much simpler query than before:

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = 878;

Our expectation is that the execution plan for this new query will be exactly the same as before we created the view. The query optimizer should be able to push the product filter specified in the WHERE clause down into the view, resulting in an index seek.

We need to stop and think a bit at this point, however. The query optimizer can only produce execution plans that are guaranteed to produce the same results as the logical query specification. Is it safe to push our WHERE clause into the view?

The answer is yes, so long as the column we are filtering on appears in the PARTITION BY clause of the window function in the view. The reasoning is that eliminating complete groups (partitions) from the window function will not affect the ranking of rows returned by the query.

The question is, does the SQL Server query optimizer know this? The answer depends on which version of SQL Server we are running.

SQL Server 2005 execution plan

SQL Server 2005 execution plan

A look at the Filter properties in this plan shows it applying two predicates:

Filter properties

The ProductID = 878 predicate has not been pushed down into the view, resulting in a plan that scans our index. It ranks every row in the table before filtering for product 878 and rows ranked 1.

The SQL Server 2005 query optimizer cannot push suitable predicates past a window function in a lower query scope (view, common table expression, in-line function or derived table). This limitation applies to all SQL Server 2005 builds.

SQL Server 2008+ execution plan

This is the execution plan for the same query on SQL Server 2008 or later:

SQL Server 2008 execution plan

The ProductID predicate has been successfully pushed past the ranking operators, replacing the index scan with an index seek.

The SQL Server 2008 query optimizer includes a new simplification rule SelOnSeqPrj (select on sequence project) that is able to push safe outer-scope predicates past window functions.

To produce the less efficient plan for this query in SQL Server 2008 or later, we have to temporarily disable this query optimizer rule:

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = 878
OPTION (QUERYRULEOFF SelOnSeqPrj); -- NEW!

Plan with SelOnSeqPrj rule disabled

Limitations

Unfortunately, the SelOnSeqPrj simplification rule only works when the predicate performs a comparison with a constant literal.

For that reason, the following query produces the sub-optimal plan on SQL Server 2008 and later (including SQL Server 2019) because it uses a variable:

DECLARE 
    @ProductID integer = 878;

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = @ProductID;

Execution plan with variable

Parameterization

The problem can still occur even where the predicate uses a constant value. SQL Server may decide to auto-parameterize trivial queries (one for which an obvious best plan exists). If auto-parameterization is successful, the optimizer sees a parameter instead of a constant, and the SelOnSeqPrj rule is not applied.

For queries where auto-parameterization is not attempted (or where it is determined to be unsafe), the optimization may still fail, if the database option for FORCED PARAMETERIZATION is on.

Our test query (with the constant value 878) is not judged safe for auto-parameterization, but the forced parameterization setting overrides this, resulting in the inefficient plan:

ALTER DATABASE CURRENT
SET PARAMETERIZATION FORCED;
GO
SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = 878;
GO
ALTER DATABASE CURRENT
SET PARAMETERIZATION SIMPLE;

Forced parameterization plan

SQL Server 2008+ workaround

To allow the optimizer to ‘see’ a constant value for query that references a local variable or parameter we can add an OPTION (RECOMPILE) query hint:

DECLARE 
    @ProductID integer = 878;

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = @ProductID
OPTION (RECOMPILE);

Note: The pre-execution (‘estimated’) execution plan still shows an index scan because the value of the variable is not actually set yet. When the query is executed, the (post-execution) plan shows the desired index seek:

Post-execution plan

The SelOnSeqPrj rule does not exist in SQL Server 2005, so OPTION (RECOMPILE) cannot help there. In case you are wondering, the OPTION (RECOMPILE) workaround results in a seek even if the database option for forced parameterization is on.

All versions workaround #1

In some situations, it is possible to replace the problematic view, common table expression, or derived table with a parameterized inline table-valued function:

CREATE FUNCTION dbo.MostRecentTransactionsForProduct
    (@ProductID integer)  
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
    SELECT
        SQ1.ProductID,
        SQ1.TransactionID,
        SQ1.ReferenceOrderID,
        SQ1.TransactionDate,
        SQ1.Quantity
    FROM 
    (
        SELECT
            TH.ProductID,
            TH.TransactionID,
            TH.ReferenceOrderID,
            TH.TransactionDate,
            TH.Quantity,
            rnk = RANK() OVER (
                PARTITION BY TH.ProductID
                ORDER BY TH.TransactionDate DESC)
        FROM Production.TransactionHistory AS TH
        WHERE
            TH.ProductID = @ProductID -- Pushed predicate
    ) AS SQ1
    WHERE
        SQ1.rnk = 1;

This function explicitly places the ProductID predicate in the same scope as the window function, avoiding the optimizer limitation. Written to use the in-line function, our example query becomes:

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsForProduct(878) AS MRT;

This produces the desired index seek plan on all versions of SQL Server that support window functions. This workaround produces a seek even where the predicate references a parameter or local variable — an OPTION (RECOMPILE) is not required.

The function body could be simplified to remove the now-redundant PARTITION BY clause, and to no longer return the ProductID column. I left the definition the same as the view it replaced to more clearly illustrate the cause of the execution plan differences.

All versions workaround #2

The second workaround only applies to ranking window functions that are filtered to return rows numbered or ranked #1 (using ROW_NUMBER, RANK, or DENSE_RANK). This is a very common usage, so it is worth mentioning.

An additional benefit is that this workaround can produce plans that are even more efficient than the index seek plans seen previously. As a reminder, the previous best plan looked like this:

Previous best execution plan

That execution plan ranks 1,918 rows even though it ultimately returns only 6. We can improve this by using the window function in an ORDER BY clause instead of ranking rows and then filtering for rank #1:

SELECT TOP (1) WITH TIES
    TH.TransactionID,
    TH.ReferenceOrderID,
    TH.TransactionDate,
    TH.Quantity
FROM Production.TransactionHistory AS TH
WHERE
    TH.ProductID = 878
ORDER BY
    RANK() OVER (
        ORDER BY TH.TransactionDate DESC);

Improved execution plan

That query nicely illustrates the use of a window function in the ORDER BY clause, but we can do even better, eliminating the window function completely:

SELECT TOP (1) WITH TIES
    TH.TransactionID,
    TH.ReferenceOrderID,
    TH.TransactionDate,
    TH.Quantity
FROM Production.TransactionHistory AS TH
WHERE
    TH.ProductID = 878
ORDER BY
    TH.TransactionDate DESC;

Top Seek Plan

This plan reads only 7 rows from the table to return the same 6-row result set. Why 7 rows? The Top operator is running in WITH TIES mode:

Top properties

The plan continues to request one row at a time from its subtree until the TransactionDate changes. The seventh row is required for the Top operator to be sure that no more tied-value rows will be seen.

We can extend the logic of the query above to replace the problematic view definition:

ALTER VIEW dbo.MostRecentTransactionsPerProduct
WITH SCHEMABINDING
AS
SELECT
    P.ProductID,
    Ranked1.TransactionID,
    Ranked1.ReferenceOrderID,
    Ranked1.TransactionDate,
    Ranked1.Quantity
FROM
    -- List of product IDs
    (SELECT ProductID FROM Production.Product) AS P
CROSS APPLY
(
    -- Returns rank #1 results for each product ID
    SELECT TOP (1) WITH TIES
        TH.TransactionID,
        TH.ReferenceOrderID,
        TH.TransactionDate,
        TH.Quantity
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID
    ORDER BY
        TH.TransactionDate DESC
) AS Ranked1;

The view now uses CROSS APPLY to combine the results of our optimized ORDER BY query for each product. The test query is unchanged:

DECLARE 
    @ProductID integer = 878;

SELECT
    MRT.ProductID,
    MRT.TransactionID,
    MRT.ReferenceOrderID,
    MRT.TransactionDate,
    MRT.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS MRT
WHERE
    MRT.ProductID = @ProductID;

Both pre- and post-execution plans show an index seek without needing an OPTION (RECOMPILE) query hint. The following is a post-execution (‘actual’) plan:

APPLY execution plan

If the view had used ROW_NUMBER instead of RANK, the replacement view would simply have omitted the WITH TIES clause on the TOP (1). The new view could also be written as a parameterized inline table-valued function.

One could argue that the original index seek plan with the rnk = 1 predicate could also be optimized to only test 7 rows. After all, the optimizer should know that rankings are produced by the Sequence Project operator in strict ascending order, so execution could end as soon as a row with a rank greater than one is seen. The optimizer does not contain this logic today.

Final Thoughts

People are sometimes disappointed by the performance of views that incorporate window functions. The reason can often be traced back to the optimizer limitation described in this post. Other times, it may be because the view designer did not appreciate that predicates applied to the view must appear in the PARTITION BY clause to be safely pushed down.

I want to emphasise that this limitation does not just apply to views, and neither is it limited to ROW_NUMBER, RANK, and DENSE_RANK. You should be aware of this limitation when using an OVER clause in a view, common table expression, derived table, or in-line table-valued function.

SQL Server 2005 users that encounter this issue are faced with the choice of rewriting the view as a parameterized inline table-valued function, or using the APPLY technique (where applicable).

Users of SQL Server 2008 or later have the extra option of using an OPTION (RECOMPILE) query hint, if the issue can be solved by allowing the optimizer to see a constant instead of a variable or parameter reference. Remember to check post-execution plans when using this hint — the pre-execution plan cannot generally show the optimal plan.


  1. This problem can also occur with derived tables, common table expressions and in-line functions. I see it most often with views because they are intentionally written to be more generic. ↩︎

1 comment:

  1. This Problem was fixed these days with SQL Server 2017 CU 30!
    Regards, Eyck

    ReplyDelete

All comments are reviewed before publication.