How query revision can impact query Performance

When we see any query, performing badly. Our first reaction to improve query performance is to check hardware and network resources for any issue. If hardware and network resources are correct. Then we go for index tuning to improve query performance. But one thing that all SQL DBA and Developers must consider, is to check the query itself that if it is written properly, if it can be written any other way possible, so that it can use optimal index and do seek operation instead of full table scan. If after assuring that query is written in a optimal way, only then we should go for index tuning.

Today I will talk about the topic of “How query revision impacts query Performance”. For demonstration purpose I need some sample data. Please run below code, to create sample database with tables and important indexes for our demonstration.

NOTE : This code should be executed only on test servers.

IF DB_ID(‘sqlzealot’) IS NOT NULL

DROP DATABASE sqlzealot;

GO

CREATE DATABASE sqlzealot;

GO

USE sqlzealot;

GO

SET NOCOUNT ON;

IF OBJECT_ID(‘dbo.Nums’, ‘U’) IS NOT NULL

DROP SYNONYM dbo.Nums;

CREATE TABLE dbo.Nums(n INT NOT NULL PRIMARY KEY);

DECLARE @max AS INT, @rc AS INT;

SET @max = 1000000;

SET @rc = 1;

INSERT INTO dbo.Nums(n) VALUES(1);

WHILE @rc * 2 <= @max

BEGIN

INSERT INTO dbo.Nums(n) SELECT n + @rc FROM dbo.Nums;

SET @rc = @rc * 2;

END

INSERT INTO dbo.Nums(n)

SELECT n + @rc FROM dbo.Nums WHERE n + @rc <= @max

GO

IF OBJECT_ID(‘dbo.Orders’, ‘U’) IS NOT NULL

DROP TABLE dbo.Orders;

GO

IF OBJECT_ID(‘dbo.Customers’, ‘U’) IS NOT NULL

DROP TABLE dbo.Customers;

GO

CREATE TABLE dbo.Customers

(

custid Char(1) NOT NULL,

custname VARChar(20) NOT NULL

);

GO

CREATE TABLE dbo.Orders

(

orderid INT NOT NULL,

custid Char(1) NOT NULL,

orderdate DATETIME NOT NULL

);

GO

Declare @customerCount INT = 5;

Declare @OrderCount INT = 1000000;

Declare @numyears AS INT = 4;

Declare @startdate AS DATETIME = ‘20050101’;

INSERT INTO dbo.Customers(custid, custname)

SELECT custid, N’Customer_’ + custid AS custname

FROM (SELECT CHAR(ASCII(‘A’) 2 + 2 * n) AS custid

FROM dbo.Nums

WHERE n <= @customerCount) AS D;

 

INSERT INTO dbo.Orders(orderid, custid, orderdate)

SELECT n AS orderid,

CHAR(ASCII(‘A’) 2

+ 2 * (1 + ABS(CHECKSUM(NEWID())) % @customerCount)) AS custid,

DATEADD(day, n / (@OrderCount / (@numyears * 365.25)), @startdate)

— late arrival with earlier date

CASE WHEN n % 10 = 0

THEN 1 + ABS(CHECKSUM(NEWID())) % 30

ELSE 0

END AS orderdate

FROM dbo.Nums

WHERE n <= @OrderCount ORDER BY CHECKSUM(NEWID());

GO

CREATE CLUSTERED INDEX idx_cl_od ON dbo.Orders(orderdate);

ALTER TABLE dbo.Orders ADD

CONSTRAINT PK_Orders PRIMARY KEY NONCLUSTERED(orderid);

 

 

In this code I created sqlzealot database. I created dbo.Nums table which is table of 1000000 continuous numbers from 1 to 1000000. I created dbo.Customers and dbo.Orders table. I populated 5 Customers values using ASCII values of alphabets. I populated 1000000 order values in dbo.Orders table using 1000000 values in Nums table. I used ABS(CHECKSUM(NEWID())) to get random custid values for 5 distinct Customers. I populated Orderdate values starting from startdate up-to 4 years. I created Clustered index idx_cl_od on on orderdate column and PRIMARY KEY NONCLUSTERED INDEX on orderid column of dbo.Orders table.

What is your Query ?

The request is to return customers that did Order before year 2004 but did not have any order after 2004. That is, a qualifying customer is one for whom you cannot find an order on or after 2004. You don’t care about customers, who have made no orders at all.

At present in our dbo.Orders table we don’t have any custid entry who has ordered before year 2004. To do some entry, run the following code to add a few customers to the Customers table and a few orders to the Orders table:

INSERT INTO dbo.Customers(custid, custname) VALUES

(‘B’, ‘Customer_B’),

(‘D’, ‘Customer_D’),

(‘F’, ‘Customer_F’),

(‘H’, ‘Customer_H’),

(‘X’, ‘Customer_X’),

(‘Y’, ‘Customer_Y’),

(‘Z’, ‘Customer_Z’);

 

INSERT INTO dbo.Orders(orderid, custid, orderdate) VALUES

(1000001, ‘B’, ‘20030101’),

(1000002, ‘D’, ‘20030101’),

(1000003, ‘F’, ‘20030101’),

(1000004, ‘H’, ‘20030101’);

 

You’re supposed to get the customers IDs B, D, F, and H in the result. These are the only customers that were active at some point but not as of 2004.

Here we want to search for the maximum orderdate value for each custid, so naturally the optimal index would be a nonclustered covering index defined with custid and orderdate as the key columns, in that order:

CREATE NONCLUSTERED INDEX idx_nc_cid_od

ON dbo.Orders(custid, orderdate);

 

The first solution that come to my mind is a natural GROUP BY query that many programmers would come up with:

Solution 1:

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

SELECT custid

FROM dbo.Orders

GROUP BY custid

HAVING MAX (orderdate) < ‘20040101’;

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

 

Here I used SET STATISTICS IO and SET STATISTICS TIME for I/O scan details and CPU time and execution time details of a query. Use Ctrl + M to include Actual execution plan before running this query.

This query ran for about half second on my computer. The optimizer produced the execution plan shown in Figure 1 for this query.

clip_image002

Figure 1

The plan shows that our covering index was fully scanned in order. The maximum orderdate was isolated for each custid by the Stream Aggregate operator. Then the filter operator filtered only customers for whom the maximum orderdate was before ‘20040101’.

Here are the vital performance measures I got for this query:

Logical reads : 2483

CPU time : 484 ms

Elapsed time : 475 ms

Of course, this solution can be a big improvement over the cursor-based one in terms of both performance and code readability and maintenance. However, a run time of close to half second for such a query might not be satisfactory. Keep in mind that an Orders table in some production environments can contain far more than one million rows.

Now we need to figure out whether this query has potential for optimization. Remember that in the execution plan for the last query, the leaf level of the index was fully scanned to obtain the latest orderdate for each customer. That scan required 2483 page reads. Our customers table contains 12 customers. In our index, the rows are sorted by custid and orderdate. This means that in some groups of rows—a group for each custid—the last row in each group contains the latest orderdate that you want to inspect.

Of course, if you request the latest orderdate for a particular customer, the optimizer can use a seek directly to the last customer’s row in the index. Such a seek would cost three reads in our case. Then the optimizer can apply a TOP operator going one step backward, returning the desired value—the latest orderdate for the given customer—to a Stream Aggregate operator. The following query demonstrates acquiring the latest orderdate for a particular customer, producing the execution plan shown in Figure 2:

SELECT MAX(orderdate) FROM dbo.Orders WHERE custid =‘A’;

clip_image004

Figure 2

This plan incurs only three logical reads. Now, if you do the math for 12 customers, you will realize that you can potentially obtain the desired result with substantially less I/O than 2483 reads.

Realizing that what you’re after is invoking a seek operation for each customer, you might come up with the following attempt as a step toward the solution (prior to filtering):

SELECT custid,

(SELECT MAX(orderdate)

FROM dbo.Orders AS O

WHERE O.custid = C.custid) AS maxod

FROM dbo.Customers AS C;

 

You query the Customers table, and for each customer, a subquery acquires the latest orderdate value (aliased as maxod).

But strangely enough, you get the plan shown in Figure 3, which looks surprisingly similar to the previous one in the sense that a full ordered scan of the index on the Orders table is used to calculate the MAX aggregate.

clip_image006

Figure 3

You may have expected the optimizer to first scan the 12 customers from the customers table and then use a loop that for each customer applies a seek operation in the index to pull the max orderdate for that customer. It appears that this query fell victim to an attempt the optimizer made to improve the query performance, while in practice it ended up hurting it. The optimizer unnested the correlated subquery, converting it internally to a join. The reason that the optimizer applies such rearrangements is that the join form tends to be optimized better. This query incurred 2483 logical reads against the Orders table and ran for close to one second on my computer. It seems that the optimizer got too sophisticated this time.

Here The optimizer pulls a trick on you; now pull your best trick. One attempt before considering a complete rewrite of the solution is to use a logically equivalent query but with the TOP option instead of MAX. The reasoning behind trying this trick is that from observations of many plans, it appears that the optimizer does not unnest subqueries when you use TOP.

You issue the following query:

SELECT custid,

(SELECT TOP (1) orderdate

FROM dbo.Orders AS O

WHERE O.custid = C.custid

ORDER BY orderdate DESC) AS maxod

FROM dbo.Customers AS C;

 

Now You will see the plan you wished for, as shown in Figure 4.

clip_image008

Figure 4

The customers table is scanned, and for each of the 12 customers, a Nested Loops operator invokes a similar activity to the one you got when invoking a query for a particular customer. This plan incurs only 2 logical reads against customers and 36 logical reads against Orders. The net CPU time is not even measurable with STATISTICS TIME (shows up as 0), and I got about 100 milliseconds of elapsed time. You can now slightly revise the code to have the subquery in the WHERE clause and filter only the customers with a maximum order date that is before 2004, like so (call it set-based solution 2):

Solution 2:

SELECT custid

FROM dbo.Customers AS C

WHERE

(SELECT TOP (1) orderdate

FROM dbo.Orders AS O

WHERE O.custid = C.custid

ORDER BY orderdate DESC) < ‘20040101’;

 

Here are the vital performance measures I got for this query:

Logical reads : 36

CPU time : 0

Elapsed time : 1 ms

The plan is very similar to the one you got prior to filtering, but with an additional filter operator :

 clip_image010

Figure 5

There can be other solutions also by making slight revision (logically meaningless ones, mind you) to the MAX version of the solution:

Solution 3 :

SELECT custid

FROM dbo.Customers AS S

WHERE

(SELECT DISTINCT MAX(orderdate)

FROM dbo.Orders AS O

WHERE O.custid = S.custid) < ‘20040101’;

Solution 4 :

SELECT custid

FROM dbo.Customers AS S

WHERE

(SELECT TOP (1) MAX(orderdate)

FROM dbo.Orders AS O

WHERE O.custid = S.custid) < ‘20040101’;

 

 

In both cases you get the more efficient plan that first scans the 12 customers and in a loop pulls the maximum order date with a seek against the index on the Orders table.

In short, I’d be reluctant to rely on any of the preceding variations just because of the big impact that the slight revisions have on the way the query is optimized. Because all these are unstable solutions which are working fine with current version of SQL Server but we can’t say for sure that they will run as optimally as they are in the future versions. I’d keep looking for alternatives that are more stable.

If you look hard enough, you will find this one (call it set-based solution 5):

Solution 5 :

SELECT custid

FROM dbo.Customers AS S

WHERE NOT EXISTS

(SELECT * FROM dbo.Orders AS O

WHERE O.custid = S.custid

AND O.orderdate >= ‘20040101’)

AND EXISTS

(SELECT * FROM dbo.Orders AS O

WHERE O.custid = S.custid);

 

 

This solution is natural and in fact is quite a literal translation of the English phrasing of the request.

You query the Customers table and filter Customers for whom you cannot find an order on or past ‘20040101’ and for whom you can find at least one order. You get the plan shown in Figure 6.

clip_image012

Figure 6

The Customers table is scanned, yielding 12 rows. For each Customer, a Nested Loops operator invokes a seek against our covering index to check whether an orderdate of ‘20040101’ or later exists for the Customer. If the answer is no, another seek operation is invoked against the index to check whether an order exists at all. The I/O cost against the Orders table is 59 reads—slightly higher than the previous solution. However, in terms of simplicity and naturalness, this solution(Solution 5) wins big time! Therefore, I would go with it.

Now you probably realized that, index tuning alone is not enough; you can do much with the way you write your queries. I hope you will like this post, Please share your thoughts and views in the comment section.

 

Advertisements

About Ashish Jain

I am Ashish Jain, a software engineer by profession. My goal of creating this blog is to share my knowledge of SQL server with all other SQL enthusiasts and also to learn from them.
This entry was posted in Advanced SQL and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s