Is using a CTE to simplify a GROUP BY bad for performance / optimisation?

I appreciate this is a generic question - I'm only after a generic answer :slight_smile:

Lets say I have

SELECT
	ActivityCode
	, EmployeeID
	, [StartTime] = DATEADD(Day, DATEDIFF(Day, 0, StartTime), 0)
	, [IsFinshed] = CASE WHEN FinishTime IS NULL THEN 0 ELSE 1 END
	, [TotalTime] = SUM(TimeWorked)
FROM	MyTable
GROUP BY 
	ActivityCode
	, EmployeeID
	, DATEADD(Day, DATEDIFF(Day, 0, StartTime), 0)
	, CASE WHEN FinishTime IS NULL THEN 0 ELSE 1 END

I have to keep the code in the GROUP BY and the SELECT "in step" Let's assume its a lot more complex than this example.

I can simplify the code (and its future maintenance :slight_smile: ) by moving the calculated stuff to a CTE and then GROUPing it in the main query:

; WITH MyCTE AS
(
SELECT
	ActivityCode
	, EmployeeID
	, [StartTime] = DATEADD(Day, DATEDIFF(Day, 0, StartTime), 0)
	, [IsFinshed] = CASE WHEN FinishTime IS NULL THEN 0 ELSE 1 END
	, TimeWorked
FROM	MyTable
)
SELECT
	ActivityCode
	, EmployeeID
	, [StartTime]
	, [IsFinshed]
	, [TotalTime] = SUM(TimeWorked)
FROM	MyCTE
GROUP BY 
	ActivityCode
	, EmployeeID
	, [StartTime]
	, [IsFinshed]

but is that a bad idea in terms of performance or what the optimiser can achieve?

I personally have been bit with CTE. depends on the growth rate of MyTable

That useful to know, thanks @yosiasz

In practice MyTable is big, but the WHERE clause (not included in above example) is for a relatively narrow band of StartTime, and that WHERE clause filter would be in the CTE part. That might be enough mitigating-circumstances to mean that performance is OK.

I was half expecting that the only valid answer would be "run some tests" :slight_smile:

Its probably going to hampered because MyTable is a VIEW, on a VIEW, on a TABLE, so opportunity for the Optimiser to take a nosedive on that alone.

Most often CTEs are resolved in-line, so it's actually efficient to use a CTE for that type of thing.

But the really big thing for performance on this is the GROUPing order. In particular, where possible, it should match the clustering order of the underlying table(s) if possible.

For a table of employee activities, it almost certainly should be clustered on either ( StartTime ) or ( EmployeeID, StartTime ) or even ( StartTime, EmployeeID ). If so, then if possible, GROUP first on the key column(s).

If instead, the table is lazily "by default" clustered on identity, your performance will not be great unless you create a covering index(es) for the underlying query(ies), and that can be a never-ending process as queries change/evolve.

have you run some sort of scaling tests for this :smirk:

I guess I am confused by this:

A CTE in this example is nothing more than an inline view - and the same process will be used by the optimizer. The view will be expanded into the query and then optimized...the view (CTE) will not be materialized prior to being incorporated into the full statement.

Now - if you are using the CTE multiple times in your query - again - it is the same as using the same view multiple times.

Not sure how the size of a table has any impact - unless you are trying to compare a CTE to a temp table which is a completely different usage.

it could have been the way I implemented it also. lets do some sample tests and post back here for comparisons using sample MyTable schema provided making it grow incrementally. it is snowing big time here in Seattle so good day for some coffee and tests!

Thanks Scott. Always useful to have your insight into making sure that the Clustered Index is fit-for-purpose. The data is in 3rd party DB, so it either will be, or not, but we are planning to "warehouse" a read-only copy locally, so that will give me the opportunity to cluster this appropriately.

Yeah, I asked you guys!

Indeed, except that if this is a 100ms operation I'm not going to spend any time improving it, but e.g. with a bigger table / environment that needs a scaleable solution I'm assuming I do need to take some trouble bench marking it. No difference there whether I'm using a CTE or some other method of course. I was just looking for a head's-up whether reusing CTE "columns" in multiple places in the resulting query (e.g. the SELECT and the GROUP BY) was likely to result in SQL generating an inefficient query, compared to one that I hand-coded. I'm sure there comes a point, either in complexity or just "bad luck", where the Optimiser does a pants-job, but sounds like, in the main, CTE will save me some typing and by having query elements NOT coded in duplicate it will reduce the chances of bugs during code maintenance in the future.