Query Challenge: Time Spent Calculation

I have a table called Activity. It has many fields in it but for this query I only care about four fields...

DateStarted
DateEnded
UserID
LessonID

The first two fields are DateTime and the last two are GUIDS.

A userid can have multiple entries per lessonid. I need to calculate total time spent on a lesson for each user. But here is where it gets complicated... some of the entries have overlapping datetimes because users are opening multiple browsers in order to reach the minimum timespent faster (yes, we should re-code the app to prevent that but we need a short term solution). So the query needs to subtract out the overlapping portion of entries. For example...

Record 1:
June 6, 9:00am - 11:00am

Record 2:
June 6, 10:00am - 11:30am

Record 3:
June 7 10:00am - 10:30am

Total time spent = 3 hours

declare @Source TABLE
( UserID tinyInt
  ,LessonID tinyInt
  ,DateStarted DATETIME2(0)
  ,DateEnded DATETIME2(0)
)

Insert into @Source(UserID,LessonID,DateStarted,DateEnded)
values(1,4,'20180606 09:00:00','20180606 11:00:00')
,(1,4,'20180606 10:00:00','20180606 11:30:00')
--,(1,4,'20180606 11:30:00','20180606 12:00:00')
--,(1,4,'20180606 14:30:00','20180606 15:00:00')
,(1,4,'20180607 10:00:00','20180607 10:30:00')

;WITH ctePrevious AS
(
SELECT * 
,LAG(DateEnded,1,NULL) OVER(PARTITION BY UserID,LessonID ORDER BY DateStarted) as prevDateEnd
FROM @Source
)
,cteGrp AS
(
SELECT *
  , CASE WHEN prevDateEnd >= DateStarted AND prevDateEnd < DateEnded THEN 0
         ELSE 1 END as grp
FROM ctePrevious
)
,cteSum AS
(
SELECT *
  , SUM(grp) OVER(PARTITION BY UserID,LessonID ORDER BY DateStarted
                  ROWS UNBOUNDED PRECEDING) as sumGrp
FROM cteGrp
)
,cteMinMax AS
(
SELECT UserID,LessonID
    ,min(DateStarted) as startDate
    ,max(DateEnded) as endDate
FROM cteSum
GROUP BY UserID,LessonID,sumGrp
)
SELECT UserID,LessonID
   ,SUM(DATEDIFF(minute,startDate,endDate)) As TotalMinutes
FROM cteMinMax
GROUP BY UserID,LessonID

output

UserID LessonID TotalMinutes
1 4 180

dbfiddle here

I belive it's something similar to an article from Itzik Ben-Gan , from where I got the inspiration.

ps: a used a few more CTEs , just for better reading/understanding the flow. You can concatenate it in a few less.

Later update:
column grp should be changed to :

   , CASE WHEN prevDateEnd >= DateStarted AND prevDateEnd < DateEnded THEN 0
         WHEN prevDateEnd >= DateStarted AND prevDateEnd >= DateEnded THEN 0
         ELSE 1 END as grp

Alternative:

select userid
      ,lessonid
      ,sum(datediff(minute,datestarted,dateended)) as lessonmin
      ,max(datediff(minute,0,dateended))
      -min(datediff(minute,0,datestarted)) as actualmin
      ,sum(datediff(minute,datestarted,dateended))
      -max(datediff(minute,0,dateended))
      +min(datediff(minute,0,datestarted)) as diffmin
  from activity
 group by userid
         ,lessonid
 having sum(datediff(minute,datestarted,dateended))
       -max(datediff(minute,0,dateended))
       +min(datediff(minute,0,datestarted))
       !=0
;

Here is another option:

   With nextStartDates
     As (
 Select s.UserID
      , s.LessonID
      , s.DateStarted
      , s.DateEnded
      , NextDateStarted = lead(s.DateStarted, 1, s.DateEnded) over(Partition By s.UserID, s.LessonID Order By s.DateStarted)
   From @Source s
        )
 Select ns.UserID
      , ns.LessonID
      , TotalMinutes = sum(datediff(minute, ns.DateStarted, iif(ns.DateEnded < ns.NextDateStarted, ns.DateEnded, ns.NextDateStarted)))
   From nextStartDates ns
  Group By
        ns.UserID
      , ns.LessonID;

We don't need to worry about creating date groups here - just need to identify where we overlap and not count the minutes/time during that overlap period.

Actually - there are a couple scenarios that need to be accounted for...that I missed in the previous solution. If the next sessions time is contained within the previous sessions time - it needs to be excluded. Also - if the next sessions start time is exactly the same as the previous - we need to account for that...

Here is an updated solution:

Declare @Source Table (
        UserID tinyint
      , LessonID tinyint
      , DateStarted datetime
      , DateEnded datetime
        );

 Insert Into @Source (
        UserID
      , LessonID
      , DateStarted
      , DateEnded
        )
 Values (1, 4, '20180606 09:00:00', '20180606 10:00:00')
      , (1, 4, '20180606 09:00:00', '20180606 10:30:00')
      , (1, 4, '20180606 09:15:00', '20180606 11:00:00')
      , (1, 4, '20180606 10:00:00', '20180606 10:30:00')
      , (1, 4, '20180607 10:00:00', '20180607 10:30:00');

   With nextStartDates
     As (
 Select s.UserID
      , s.LessonID
      , s.DateStarted
      , s.DateEnded
      , NextDateStarted = lead(s.DateStarted, 1, s.DateEnded) over(Partition By s.UserID, s.LessonID Order By s.DateStarted, s.DateEnded)
      , PrevDateEnded = lag(s.DateEnded, 1, s.DateStarted) over(Partition By s.UserID, s.LessonID Order By s.DateStarted, s.DateEnded)
   From @Source s
        )
 Select ns.UserID
      , ns.LessonID
      , TotalMinutes = sum(datediff(minute, ns.DateStarted, iif(ns.DateEnded < ns.NextDateStarted, ns.DateEnded, ns.NextDateStarted)))
 --Select *
   From nextStartDates ns
  Where ns.PrevDateEnded < ns.DateEnded
  Group By
        ns.UserID
      , ns.LessonID;

For this one - we order the lead/lag to include the date ended which will insure that we calculate 0 minutes for all sessions that have the same start date until the latest session, and then we eliminate the rows where the previous date ended value is greater than the current date ended value - or rather, we only include rows to be counted where the previous date ended is less than current date ended.

Another


if OBJECT_ID('tempDb..#temp') is not null drop table #temp;

create table #temp
(
UserID varchar(10),
lessonId varchar(10),
DateStarted datetime2,
DateEnded datetime2,

);

create index #tempIdx on #temp(UserId, lessonId);

insert #temp
values
('1','1111','6/6/2018  9:00:00 AM','6/6/2018  11:00:00 AM'),
('1','1111','6/6/2018  10:00:00 AM','6/6/2018  11:30:00 AM'),
('1','1111','6/7/2018  10:00:00 AM','6/7/2018  10:30:00 AM'),
('2','1111','9/15/2018  10:00:00 AM','9/15/2018  11:30:00 AM'),
('2','1111','9/15/2018  9:30:00 AM','9/15/2018  10:15:00 AM'),
('2','1111','9/15/2018  2:00:00 PM','9/15/2018  3:00:00 PM'),
('1','2222','9/15/2018  9:00:00 AM','9/15/2018  11:00:00 AM'),
('1','2222','9/16/2018  10:00:00 AM','9/16/2018  11:30:00 AM'),
('1','2222','9/16/2018  11:15:00 AM','9/16/2018  2:30:00 PM')

;

with cte
as
(
select a.UserID , a.lessonId

,DATEDIFF(MINUTE, b.sd, b.ed)/cast(60 as decimal(4,2) ) [hrs]

from #temp a
outer apply
(
select MIN(b.DateStarted) [sd]
,MAX(b.DateEnded) [ed]
from #temp b
where b.UserID = a.UserID 
and b.lessonId = a.lessonId 
and not b.DateEnded < a.DateStarted
and not b.DateStarted > a.DateEnded
) b

group by a.UserID , a.lessonId , b.sd , b.ed 

)

select a.UserID , a.lessonId , sum(a.hrs ) [Total time spent]
from cte  a
group by a.UserID , a.lessonId;

Of course - thinking about this a bit more I found another issue...if multiple rows can be covered by previous rows the calculation will be incorrect.

Declare @Source Table (
        UserID tinyint
      , LessonID tinyint
      , DateStarted datetime
      , DateEnded datetime
        );

 Insert Into @Source (
        UserID
      , LessonID
      , DateStarted
      , DateEnded
        )
 Values (1, 4, '20180606 09:00:00', '20180606 11:00:00')        -- 120 minutes
      , (1, 4, '20180606 09:01:00', '20180606 09:30:00')        -- overlapped
      , (1, 4, '20180606 09:02:00', '20180606 09:35:00')        -- overlapped
      , (1, 4, '20180606 09:03:00', '20180606 09:40:00')        -- overlapped
      , (1, 4, '20180606 09:04:00', '20180606 11:15:00')        -- 15 minutes
      , (1, 4, '20180606 09:15:00', '20180606 11:30:00')        -- 15 minutes
      , (1, 4, '20180606 10:00:00', '20180606 10:30:00')        -- overlapped
      , (1, 4, '20180607 10:00:00', '20180607 10:30:00');       -- 30 minutes
                                                                -- Total 180 minutes

   With getMaxDates
     As (
 Select s.UserID
      , s.LessonID
      , s.DateStarted
      , s.DateEnded
      , MaxDateEnded = max(s.DateEnded) over(Partition By s.UserID, s.LessonID 
                                                 Order By s.DateStarted, s.DateEnded 
                                             Rows Between Unbounded Preceding And Current Row)
   From @Source                  s
        )
      , getNextStarted
     As (
 Select md.UserID
      , md.LessonID
      , md.DateStarted
      , md.DateEnded
      , md.MaxDateEnded
      , NextDateStarted = lead(md.DateStarted, 1, md.DateEnded) over(Partition By md.UserID, md.LessonID
                                                                         Order By md.DateStarted, md.DateEnded)
   From getMaxDates             md
  Where md.MaxDateEnded <= md.DateEnded
        )
 Select ns.UserID
      , ns.LessonID
      , TotalTime = sum(datediff(minute, ns.DateStarted, iif(ns.NextDateStarted < ns.DateEnded, ns.NextDateStarted, ns.DateEnded)))
   From getNextStarted          ns
  Group By
        ns.UserID
      , ns.LessonID;

This change addresses that issue and resolves for all scenarios. In this one we get the max date ended from any previous rows - then we eliminate any rows where the date ended is less than the max date ended - now we can summarize the difference for each row.

You guys are awesome! Thanks for all of the queries. I will give them a try. The one thing I didn't mention in my original post is that this will be executed against about 20 million records so performance is a consideration.

Thanks!