SQLTeam.com | Weblogs | Forums

Rolling average in SQL Server 2008


#1

Hello forum,

I have been trying to practice creating a rolling average and was following the online advice here (https://www.brentozar.com/archive/2013/02/rolling-averages-in-sql-server/) as I am using SQL Server 2008 and so I don't have some new functionality that would make the exercise easier.

For some reason, it isn't working though (I have created a calendar table in my database and I am using some simple sales data). Does anyone know why?

With cte
as (select DENSE_RANK() over (order by YEAR([Order Date]), datepart(mm, [Order Date])) as r,
YEAR([Order Date]) as [year],
datepart(mm, [Order Date]) as [month],
sum(Sales) as Sales
from dbo.Calendar c
left join dbo.LBPractice p on p.[Order Date] = c.[Date]
where c.[Date] between '2014-01-01'
and '2015-01-01'
group by YEAR([Order Date]),
datepart(mm, [Order Date]))

Select cte1.r,
cte1.year,
cte1.month,
avg(cte2.Sales) as AverageSales
from cte as cte1
join cte as cte2 on cte1.r > (cte2.r -12) and cte1.r <> cte2.r
group by cte1.year, cte1.month, cte1.r
order by cte1.year, cte1.month, cte1.r;

Thanks in advance!


#2

Post DDL for tables and inserts of sample data with desired result please.


#3

And please tell us what you mean by "isn't working" because we don't have any of your data.


#4

What I would like to see is a rolling average, in this example I think it should be a year but it could equally have been a 3-month rolling average so the fields would be for example year, month, average (in this case sum of sales).

How it isn't working is that I get a value but I can't work out where the value came from as in it doesn't look to be the correct average to me. I thought this might be to do with something in the script, I have made a mistake on. Not sure if this is obvious to an expert?


#5

My practice table is rows of sales transactions:

insert into dbo.LBPractice ([Row ID], [Order ID], [Order Date], [Ship Date],
[Ship Mode], [Customer ID], [Customer Name], Segment, City, State, Country,
Region, [Product ID], Category, [Sub-Category], [Product Name], Sales, Quantity,
Discount, Profit)

e.g.

values
(9001, 'ES-2013-1216394', '2014-8-31', '2014-8-31', 'Same Day', 'JK-16120', 'Julie Kriz', 'Home Office', 'Aylesbury', 'England', 'United Kingdom', 'North', 'OFF-LA-10004753', 'Office Supplies', 'Labels', 'Novimex Round Labels, Laser Printer Compatible', 25.08, 4, 0, 11.76),


#6

My calendar table to create all days in history and future:

--creating the calendar table
create table dbo.Calendar
([Date] datetime not null,
FirstDayOfMonth datetime not null,
LastDayOfMonth datetime not null)

--inserting days into the calendar table
declare @the_date as datetime;

set @the_date = N'19990101'

while @the_date < (DATEADD(yyyy, 10, getdate()))

begin
insert into dbo.Calendar
([Date],
FirstDayOfMonth,
LastDayOfMonth)

values
(@the_date,
DATEADD(mm, datediff(mm, 0, @the_date), 0),
DATEADD(MM, DATEDIFF(mm, 0, @the_date), -1))
;

select @the_date = DATEADD(dd, 1, @the_date);
end


#7

Although I wouldn't have written the script the way you have, it looks possible that it could be correct and could be an underlying data problem. That's why some of us have asked for table definitions and a bit of data so that we can see if that might be true.

Shifting gears to that subject, could you tell us what the datatype for the [Order Date] column in the LBPractice table is? I also need to know what the PK column(s) is/are and what their datatypes are for that same table.

Once I know that, I can gin up an experimental bit of demo code that I think you'll like with the caveat that I can't start it until tonight.


#8

Very useful feedback Jeff, I thank you. I am only looking for help with a rolling average in SQL server 2008 so please do let me know if you think of a better way! :wink: I really appreciate your help!

The data type for [Order Date] is just date (not datetime). There is a [Row ID] field which is int and this is the primary key.

Thanks so much! :grinning:


#9

Apologies for the delay. I got nailed by production issues last night.

Thank you for the information. I have enough information to build an example for you now. I'll bang it out for you tonight.


#10

First and again, apologies for the delay. Work has been a bit of a bear lately. Hopefully, though, you'll find it worth the wait.

I never like to post code for a solution without testing it for performance. With that, let's make a larger table that simulates your table.

Before we get into that, we need a helper tool known as "fnTally" which very quickly generates rows with a sequential integer value in it. As with a Tally or Numbers table, it replaces a much slower loop or recursive CTE and can be used as a simple "row source" or "presence of rows" to create large quantities of randomized data. The function is fully documented in the comments in the code.

After doing a visual safety check, please run the following code in your test database to create the function. Once you get used to using the hidden power of this function, you should probably put a copy of the function in all your databases.

 CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
 Purpose:
 Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Trillion.

 As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

 Usage:
--===== Syntax example (Returns BIGINT)
 SELECT t.N
   FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;

 Notes:
 1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
    Refer to the following URLs for how it works and introduction for how it replaces certain loops. 
    http://www.sqlservercentral.com/articles/T-SQL/62867/
    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
 2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
    will cause the sequence to start at 1.
 3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
 5. If @MaxN is negative or NULL, a "TOP" error will be returned.
 6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
    that many values, you should consider using a different tool. ;-)
 7. There will be a substantial reduction in performance if "N" is sorted in descending order.  If a descending 
    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT; 
     SELECT @MaxN = 1000;
     SELECT DescendingN = @MaxN-N+1 
       FROM dbo.fnTally(1,@MaxN);

 8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

 Revision History:
 Rev 00 - Unknown     - Jeff Moden 
        - Initial creation with error handling for @MaxN.
 Rev 01 - 09 Feb 2013 - Jeff Moden 
        - Modified to start at 0 or 1.
 Rev 02 - 16 May 2013 - Jeff Moden 
        - Removed error handling for @MaxN because of exceptional cases.
 Rev 03 - 22 Apr 2015 - Jeff Moden
        - Modify to handle 1 Trillion rows for experimental purposes.
**********************************************************************************************************************/
        (@ZeroOrOne BIT, @MaxN BIGINT)
RETURNS TABLE WITH SCHEMABINDING AS 
 RETURN WITH
  E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1)                                  --10^1 or 10 rows
, E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d)      --10^4 or 10 Thousand rows
,E12(N) AS (SELECT 1 FROM E4 a, E4 b, E4 c)            --10^12 or 1 Trillion rows                 
            SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
             UNION ALL 
            SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E12 -- Values from 1 to @MaxN
;

The test code generator uses the fnTally function as a "row source" to create a permanent table (we need it to be permanent, in this case) with a substantial 5 Million rows of data and the required indexes in something less than 30 seconds if your test database isn't in the FULL Recovery Model. The data is highly randomized but predictably constrained. Again, a bit of detail is contained in comments in the code.

/**********************************************************************************************************************
 Create a new test table called dbo.LBPractice1 so that we don't overwrite the OP's original table and populate it
 with randomized but constrained test data.  See the comments below for details
**********************************************************************************************************************/
--===== If the test table already exists, drop it to make reruns in SSMS easier.
     IF OBJECT_ID('dbo.LBPractice1','U') IS NOT NULL
        DROP TABLE dbo.LBPractice1
;
GO
--===== Create the test table with the CLUSTERED PK already in place.
 CREATE TABLE dbo.LBPractice1
        (
         Row_ID         INT             IDENTITY(1,1) NOT NULL
        ,[Order Date]   DATE            NOT NULL
        ,Sales          DECIMAL(9,2)    NOT NULL
        ,OtherCols      CHAR(150)       NOT NULL --Just to add some bulk to the table
        CONSTRAINT PK_LBPractice1 PRIMARY KEY CLUSTERED (Row_ID)
        )
;
--===== Populate the table with 5 million rows of randomized data 
     -- with dates starting on 2010-01-01 thru today
     -- and sales values of 1.00 to 200.99.
     -- Not to worry... this doesn't take a long time.
     -- If you're NOT using the FULL Recovery Model for your test database, 
     -- this only takes about 20 seconds to generate data INLUDING the clustered index
     -- and about 4 seconds to add the Non Clustered index.
     -- So, normally something less than 30 seconds for this.
 INSERT INTO dbo.LBPractice1 WITH (TABLOCK) --Minimally logged and very fast if not in FULL Recovery Model
        ([Order Date],Sales,OtherCols)
 SELECT  [Order Date]   = CONVERT(DATE,DATEADD(dd,ABS(CHECKSUM(NEWID())%DATEDIFF(dd,'2010',GETDATE())),'2010'))
        ,Sales          = CONVERT(DECIMAL(9,2),RAND(CHECKSUM(NEWID()))*200+1)
        ,OtherCols      = CONVERT(CHAR(150),'X') -- Simulate the bulk of the other columns
   FROM dbo.fnTally(1,5000000)
 OPTION (RECOMPILE) --Helps minimal logging work (undocumented but true ;-) )
;
--===== Create an important non-clustered index.
     -- If your real table doesn't have this index,
     -- consider building it because it makes what's coming fly.
     -- Minimally logged if not in FULL Recovery Model.
 CREATE INDEX By_OrderDate 
     ON dbo.LBPractice1 ([Order Date])
INCLUDE (Sales)
;
GO

In the article you cited, the table (Sales.SalesOrderHeader) is from the "AdventureWorks" database and only has a total of 31,465 rows in the 2008 version. That's not enough for a real performance test, in my humble opinion. If we query a year's worth of data from the test table we just create, we end up with more than 700 THOUSAND rows, which is starting to look more like a decent performance test. Here's some code that demonstrates that.

--===== Get a count of the actual rows we're going to deal with for the demo.
     -- This includes the required extra two months at the begining of the year 
     -- to correctly form 3 month running averages.
 SELECT COUNT(*) --764,590 Rows in my test. Because of it being random data, YMMV
   FROM dbo.LBPractice1
  WHERE [ORDER DATE]   >= DATEADD(mm,-2,'2014-01-01')
    AND [ORDER DATE]    < '2015-01-01' 
;
GO

Just as with any code, we need to keep our data layer separate from the presentation layer. We also need to follow the principle of working only with the data we need in the form of "pre-aggregated" (a term coined by Peter "PESO" Larsson, an early YAK member of these very forums). Last but not least, we also need to avoid the "dark secret" that the cited article mentioned. Since we have to hit our aggregates many times, it's MUCH more efficient to save the result of the CTE rather than recalculating more than once.

It's also known as "Divide'n'Conquer", which can sometimes run lightning quick compared to "all in one super queries" like the one from the cited article. Despite what Joe Celko frequently and incorrectly rants about, remember that "Set-Based" does not now nor has it ever meant that you have to do everything in one query. The "1950's" technology of materializing an interim result set worked very well back then and it still works incredibly well even in this day and age.

Shifting gears back to a solution, you'll see that I didn't provide just a script for the solution because the next thing out of people's mouths is "How do I turn this into a stored procedure that takes parameters"? The reason why I've not turned it into a stored procedure is because the next thing out of people's mouths is "How do I use the output of the stored procedure in another part of my query"?

So here's a Multistatement Table Valued Function ("mTVF" for short) that works just like a stored procedure but you can include it in the FROM clause of another query. Rather that run you through everything that this function does, please feel free to read the embedded comments and explore the code on your own. In particular, notice how I've replaced the functionality of a Calendar table in the code.

Here's the function. After a visual safety check, please run the code in your test database to deploy the function for your testing. I did make the assumption that people who want a running average of months also want to see the original monthly total and a cumulative running total, as well. If you don't need those in your output, simple don't select those columns from the function.

 CREATE FUNCTION dbo.MovingAvgReport
/**********************************************************************************************************************
 Purpose:
 Given the number of months to consider for a "leading" rolling average, the starting month, and the start of the 
 first month NOT to be included, return a report of total MonthSales, a running total of sales (CumeSales) and the
 rolling leading average.  Only the months identified by the start and end dates (not including the month of the end
 date) will be returned so, by themselves, the first couple of months may not look correct but actually are.  The data
 that was used to calculate the rolling average just isn't visible in the output.

 Example Usage:
 SELECT [YYYY-MMM],MonthSales,CumeSales,AvgSales
   FROM dbo.MovingAvgReport(3,'2014-01-01','2015-01-01')
  ORDER BY MonthNumber
;
 Programmer's Notes:
 1. Note that the CROSS APPLYs in the code below use 
Revision History:
Rev 00 - 19 Aug 2017 - Jeff Moden
       - Demo for the following forum post;
       - http://forums.sqlteam.com/t/rolling-average-in-sql-server-2008/11158
       - Also reference the following forum post, which is the basis of the OP's question.
       - https://www.brentozar.com/archive/2013/02/rolling-averages-in-sql-server/
**********************************************************************************************************************/
--===== These are the obviously named parameters for the function.
        (
         @pAvgMonths    INT  --Number of months for the moving average.
        ,@pStartMonth   DATE --Inclusive
        ,@pEndMonth     DATE --Exclusive or first month you don't want to include
        )
RETURNS @Return TABLE 
        (
         MonthNumber    INT
        ,[YYYY-MMM]     CHAR(8)   
        ,MonthSales     DECIMAL(9,2)
        ,CumeSales      DECIMAL(38,2)    
        ,AvgSales       DECIMAL(9,2)
        )
     AS
  BEGIN
-----------------------------------------------------------------------------------------------------------------------
--===== Because we're working in a function, we can't use Temp Tables so we need
     -- to use a Table Variable. It's ok for this because the row counts will be low.
DECLARE @PreAgg TABLE
        (
         MonthNumber    INT     NOT NULL
        ,MonthDate      DATE    NOT NULL
        ,MonthSales     DECIMAL(38,2)
        )
;
--===== Preaggregate our data and remember it so that we don't have to recalculate it
     -- many times like you would with multiple calls to a CTE.
   WITH ctePreAgg AS
(--===== Pre-aggregate the data we want to work with
 SELECT  OrderMonth     = DATEADD(mm,DATEDIFF(mm,0,[ORDER DATE]),0)
        ,MonthSales     = SUM(Sales)
   FROM dbo.LBPractice1
  WHERE [ORDER DATE]   >= DATEADD(mm,(-@pAvgMonths+1),@pStartMonth)
    AND [ORDER DATE]    < @pEndMonth
  GROUP BY DATEDIFF(mm,0,[ORDER DATE])
)
,
        cteCalendar AS
(--==== Create the necessary dates that will help us return a zero value for any missing months
 SELECT  MonthNumber    = t.N-(@pAvgMonths-2)
        ,MonthDate      = CONVERT(DATE,DATEADD(mm,t.N,DATEADD(mm,(-@pAvgMonths+1),@pStartMonth)))
   FROM dbo.fnTally(0,DATEDIFF(mm,DATEADD(mm,(-@pAvgMonths+1),@pStartMonth),@pEndMonth)-1) t 
)
 --==== Populate the pre-aggregation table with the minimum amount of data we need, which are whole months and sums.
 INSERT INTO @PreAgg
        (MonthNumber,MonthDate, MonthSales)
 SELECT  MonthNumber    = ISNULL(ca.MonthNumber,0) --Makes the created column NOT NULL
        ,ca.MonthDate
        ,MonthSales     = ISNULL(CONVERT(DECIMAL(38,2),pa.MonthSales),0)
   FROM ctePreAgg pa
  RIGHT JOIN cteCalendar ca ON pa.OrderMonth = ca.MonthDate
;
--===== Now we can calculate the "display" dates, the running total, and the running average on just
     -- a few rows from the pre-aggregated data.  Note that this DOES use a method that is really bad
     -- for performance known as a "Triangular Join" but, for the super small rowcounts that this function
     -- will ever see in the pre-aggregated table, it's faster than hitting a CTE 3 times.
     -- Please see the following article for what a "Triangluar Join" is and why they're so bad for performance.
     -- http://www.sqlservercentral.com/articles/T-SQL/61539/
     -- The thing that makes it bad is the "super/hidden RBAR" in the correlated subqueries that have
     -- aggregates based on in-equalities.  Each row of return is formed by 1/2 of a cartesian product.
     -- Again, avoid these if you can but there was no other way in 2008 to get this into a function and
     -- the number of rows we're dealing with are super tiny.
 INSERT INTO @Return
        (MonthNumber,[YYYY-MMM],MonthSales,CumeSales,AvgSales)
 SELECT  pa.MonthNumber
        ,[YYYY-MMM]     = DATENAME(yy,pa.MonthDate)+'-'+LEFT(DATENAME(mm,pa.MonthDate),3)
        ,pa.MonthSales
        ,CumeSales      = c.CumeSales
        ,AvgSales       = CONVERT(DECIMAL(9,2),a.AvgSales)
   FROM @PreAgg pa
  CROSS APPLY (SELECT SUM(ca.MonthSales) FROM @PreAgg ca WHERE ca.MonthNumber BETWEEN 1 AND pa.MonthNumber) c (CumeSales)
  CROSS APPLY (SELECT AVG(ca.MonthSales) FROM @PreAgg ca WHERE ca.MonthNumber BETWEEN pa.MonthNumber-2 AND pa.MonthNumber) a (AvgSales)
  WHERE pa.MonthNumber > 0
;
 RETURN;
-----------------------------------------------------------------------------------------------------------------------
    END
;
GO

So, how to use it? That information is included in the "Example Usage" section of the header (I build my production functions with the same goodies in them to make my life easier a year from now when someone asks me to modify the function). But, here's that example...

 SELECT [YYYY-MMM],MonthSales,CumeSales,AvgSales
   FROM dbo.MovingAvgReport(3,'2014-01-01','2015-01-01')
  ORDER BY MonthNumber
;

If you've been following along and executing all of the code I posted above and run the code immediately above, you'll find that we get the monthly sum, running total, and moving average for a 12 month period containing more than 700,000 rows out of a 5 million row table in only 370 MILLI-seconds and I've used three of the "worst" practices (Table Valued Variables, Triangular Joins, and mTVFs) there are to do it. :wink:

Like I said, I hope that was worth the wait. Let me know if you have any questions after you try the code.


#11

As always Jeff, a great answer and even better explanation. Very well done sir!

Just as a side note, assuming there isn't some nuance that I'm missing (entirely possible), it looks like both of the CROSS APPLYs could be combined, reducing the expense of 2 correlated sub-queries down to 1...

CROSS APPLY (
			SELECT 
				SUM(ca.MonthSales),
				AVG(CASE WHEN ca.MonthNumber >= pa.MonthNumber - 2 THEN ca.MonthSales ELSE NULL END)
			FROM 
				@PreAgg ca
			WHERE 
				ca.MonthNumber BETWEEN 1 AND pa.MonthNumber
			) cax (CumeSales, AvgSales)

#12

Thanks, Jason,

I believe the nuance that will trip that up is, for example, when doing a 3 month rolling average on a 12 month return. In other words, what is it in your code that will relegate what the 3 month average is based on, which should be months 10, 11, and 12 for month 12. I may be missing a nuance myself but I don't see how that is controlled in your code.

Either that or I have to stop reading forum posts after 1:30 in the morning. :wink:


#13

Scratch my last... I see it. Good ol' NULL to the rescue. Thank you. I learned something new today.


#14

Thank you so much @JeffModen, sorry for the delay in replying - it took me a while to digest!

This has really helped me and taught me a lot.

Thank you for taking the time to explain each element so carefully! :grinning:


#15

My pleasure @Uganda. Thank your for the feedback and thank you for taking the time to digest the post instead of just consuming the code.