Multi-column LAG code is verbose

Kristen · March 8, 2017, 10:26am

I am having a problem because my LAG code, which needs to check multiple columns for a match (against previous row) is so verbose to write (and that worries me that during code DEV or Maintenance there will be a small typo and therefore Bug / Cockup).

However, compared to using a #TEMP table with an IDENTITY and LEFT OUTER JOIN #TEMP to find the ID-1 row, or using a CTE with ROW_NUMBER() and doing a similar LEFT OUTER JOIN, the Query Plan looks to be much MUCH more efficient.

So is there a more concise way to code this?

CREATE TABLE dbo.TEMP_LAG_Test
(
	InvoiceNo 	INT NOT NULL
	, SubInvoiceNo	INT NOT NULL
	, MatterID	VARCHAR(20) NOT NULL
	, ChargeAmount	MONEY NOT NULL
	, PRIMARY KEY
	(
		InvoiceNo
		, SubInvoiceNo
		, MatterID
		, ChargeAmount
	)
)

INSERT INTO dbo.TEMP_LAG_Test
VALUES	(1234, 1, 'ABC123', 1.23)
	, (1234, 1, 'ABC123', 2.34)
	, (1234, 1, 'ABC123', 2.43)
	, (1234, 1, 'XYZ456', 3.45)
	, (1234, 1, 'XYZ456', 4.56)
	, (1234, 2, 'ABC123', 5.67)
	, (1234, 2, 'ABC123', 6.78)
	, (1234, 2, 'XYZ456', 7.89)
	, (1234, 2, 'XYZ456', 8.90)
	, (1234, 2, 'XYZ456', 8.09)
	, (5678, 1, 'ABC123', 9.01)	-- A different invoice

Lets say that I just want a True/False value when any of InvoiceNo, SubInvoiceNo, MatterID change. My first attempt is this:

SELECT	[NewSubHeading] = 
		CASE WHEN T.InvoiceNo = 
				LAG(T.InvoiceNo, 1) OVER(
						PARTITION BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID 
						ORDER BY T.ChargeAmount
						)
			AND T.SubInvoiceNo =
				LAG(T.SubInvoiceNo, 1) OVER(
						PARTITION BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID 
						ORDER BY T.ChargeAmount
						)
			AND T.MatterID = 
				LAG(T.MatterID, 1) OVER(
						PARTITION BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID 
						ORDER BY T.ChargeAmount
						)
			THEN 0 ELSE 1 END
	, T.InvoiceNo
	, T.SubInvoiceNo
	, T.MatterID
	, T.ChargeAmount
FROM	dbo.TEMP_LAG_Test AS T
ORDER BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID
	, T.ChargeAmount	-- Tie break
GO

DROP TABLE dbo.TEMP_LAG_Test
GO

stepson · March 8, 2017, 10:35am

My first idea is to use ROW_NUMBER :

SELECT
    [NewSubHeading] = CASE WHEN RN = 1 THEN 1 ELSE 0 END
    , T.InvoiceNo
	, T.SubInvoiceNo
	, T.MatterID
	, T.ChargeAmount
FROM
(    
    SELECT
        ROW_NUMBER()OVER(PARTITION BY T.InvoiceNo ,T.SubInvoiceNo ,T.MatterID  
                         ORDER BY  T.InvoiceNo, T.SubInvoiceNo, T.MatterID 	, T.ChargeAmount) AS RN                         
        , T.InvoiceNo
        , T.SubInvoiceNo
        , T.MatterID
        , T.ChargeAmount
    FROM	dbo.TEMP_LAG_Test AS T
)T
ORDER BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID
	, T.ChargeAmount	-- Tie break

I get the same results

  	NewSubHeading	InvoiceNo	SubInvoiceNo	MatterID	ChargeAmount
	1	1234	1	ABC123	1,2300
	0	1234	1	ABC123	2,3400
	0	1234	1	ABC123	2,4300
	1	1234	1	XYZ456	3,4500
	0	1234	1	XYZ456	4,5600
	1	1234	2	ABC123	5,6700
	0	1234	2	ABC123	6,7800
	1	1234	2	XYZ456	7,8900
	0	1234	2	XYZ456	8,0900
	0	1234	2	XYZ456	8,9000
	1	5678	1	ABC123	9,0100

Kristen · March 8, 2017, 10:54am

Thanks, hadn't thought of that one.

That has the same Logical Rows and Scans as my test, so performance looks good (bit of a small sample size! I'll try something bigger in a moment).

There's a lot less going on (in terms of Stream Aggregate to get LAST_VALUE, and a Window Spool(ROWS BETWEEN), and some Compute Scalar(DEFINE: for TopRowNumberNNNN and BottomRowNumberNNNN, and whilst they may be pretty trivial they aren't "nothing"

Kristen · March 9, 2017, 12:52pm

I wonder if there is a "risk" that the main ORDER BY gets out of sync with the ROW_NUMBER() one, and thus perhaps worth having an additional ROW_NUMBER() in the sub-select, to sequence the whole data list, and then using that as the main ORDER BY - just so that the two ORDER BY sets of code are near each other and thus more likely to be kept-in-step

...
FROM
(
    SELECT
        [RN] = ROW_NUMBER()OVER(PARTITION BY T.InvoiceNo ,T.SubInvoiceNo ,T.MatterID  
                         ORDER BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID, T.ChargeAmount)
        , [RN2] = ROW_NUMBER()OVER(
                         ORDER BY T.InvoiceNo, T.SubInvoiceNo, T.MatterID, T.ChargeAmount)
        ...
)T
ORDER BY RN2

It adds a

Sequence Project(DEFINE:([Expr1002]=row_number))

but the rest of the query plan is the same, as is the number of logical reads.

stepson · March 10, 2017, 8:13am

Adding an additional Row_Number() will not hurt (or at least a minimal overhead) , then is better
to be safe