Help Grouping/Collapsing DataSet

Hello, I need assistance with how to summarize this data set to get the desired output.

Sample Data:
drop table if exists #temp
create table #temp
([PatientID] varchar(8),[DueDate] date,[TreatmentEnd] int)
insert into #temp
values
('PatientA','3/11/2022',0)
,('PatientA','3/12/2022',0)
,('PatientA','3/13/2022',0)
,('PatientA','3/14/2022',0)
,('PatientA','3/15/2022',0)
,('PatientA','3/16/2022',0)
,('PatientA','3/17/2022',0)
,('PatientB','12/17/2020',0)
,('PatientB','12/18/2020',0)
,('PatientB','12/19/2020',0)
,('PatientB','12/20/2020',0)
,('PatientB','12/21/2020',0)
,('PatientB','12/22/2020',1)
,('PatientB','2/23/2021',0)
,('PatientB','2/24/2021',0)
,('PatientB','2/25/2021',0)
,('PatientB','2/26/2021',1)
,('PatientC','1/8/2022',0)
,('PatientC','1/9/2022',0)
,('PatientC','1/10/2022',0)
,('PatientC','1/11/2022',0)
,('PatientC','1/12/2022',1)
,('PatientC','2/1/2022',0)
,('PatientC','2/2/2022',0)
,('PatientC','2/3/2022',0)
,('PatientC','2/4/2022',0)
,('PatientC','2/5/2022',0)
,('PatientC','2/6/2022',0)
,('PatientC','2/7/2022',1)
,('PatientC','2/21/2022',0)
,('PatientC','2/22/2022',0)
,('PatientC','2/23/2022',0)
,('PatientC','2/24/2022',0)
,('PatientC','2/25/2022',0)
,('PatientC','2/26/2022',0)
,('PatientC','2/27/2022',1)

I need the final output to look like this:
PatientID | StartDate | EndDate
PatientA | 3/11/2022 | null
PatientB | 12/7/2020 | 12/22/2020
PatientB | 2/23/2021 | 2/26/2021
PatientC | 1/8/2022 | 1/12/2022
PatientC | 2/1/2022 | 2/7/2022
PatientC | 2/21/2022 | 2/27/2022

For PatientA, they started a treatment on 3/11/22, it's still current so there's no EndDate for that treatment, that would be null.
For the others, they started and stopped a few times. I need to capture the end points for each treatment.

Appreciate if you have any ideas, just can't seem to crack this logic puzzle. Please let me know if you have any questions.

Thank you!

1 Like

Welcome! GREAT job on providing sample data!

The best query I can come up with now is below. If you have a performance issue with it, let me know and I can try to refine it. Would likely need to use a temp table (rather than a cte) to get better performance. At least off the top of my head that's the best I can think of.


;WITH cte_get_date_ranges AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY PatientID 
        ORDER BY StartDate) AS row_num
    FROM (
        SELECT PatientID,
            TreatmentEnd,
            MIN(DueDate) AS StartDate,
            MAX(DueDate) AS PrelimEndDate
        FROM (
            SELECT *,
                DENSE_RANK() OVER (PARTITION BY PatientID ORDER BY DueDate) - 
                DENSE_RANK() OVER (PARTITION BY PatientID, TreatmentEnd ORDER BY DueDate) AS Grp
            FROM #temp
        ) AS derived1
        GROUP BY PatientID,
            TreatmentEnd,
            Grp
    ) AS derived2
)
SELECT 
    cg1.PatientID, cg1.StartDate, cg2.PrelimEndDate AS EndDate        
FROM cte_get_date_ranges cg1
LEFT OUTER JOIN cte_get_date_ranges cg2 ON cg2.PatientID = cg1.PatientID AND 
    cg2.row_num = cg1.row_num + 1 AND cg2.TreatmentEnd = 1
WHERE cg1.TreatmentEnd = 0
ORDER BY cg1.PatientID, cg1.StartDate
1 Like

Wow thank you so much Scott, that worked perfectly! I really appreciate you helping me out with this!

You're welcome!

For the first inner query I "borrowed" a basic "gaps and islands"-style solution. The rest I just improvised.

As to performance, the good thing is that SQL is very good at optimizing the performance of DENSE_RANK (and other windowing functions), so I was hoping it would be ok.

Hi again, I didn't quite have this right. I was calculating the Treatment Ends incorrectly. Here is sample raw data for a patient.

drop table if exists #temp
create table #temp
([PatientID] varchar(8),[DueDate] date,[Completed] int,[Missed] int)
insert into #temp
values
('PatientA','9/23/2021',0,1)
,('PatientA','9/24/2021',1,0)
,('PatientA','9/25/2021',1,0)
,('PatientA','9/26/2021',1,0)
,('PatientA','9/27/2021',1,0)
,('PatientA','9/28/2021',1,0)
,('PatientA','9/29/2021',1,0)
,('PatientA','9/30/2021',1,0)
,('PatientA','10/1/2021',1,0)
,('PatientA','10/2/2021',1,0)
,('PatientA','10/3/2021',1,0)
,('PatientA','10/4/2021',1,0)
,('PatientA','10/5/2021',1,0)
,('PatientA','10/6/2021',1,0)
,('PatientA','10/7/2021',1,0)
,('PatientA','10/8/2021',1,0)
,('PatientA','10/9/2021',0,1)
,('PatientA','10/10/2021',1,0)
,('PatientA','10/11/2021',0,1)
,('PatientA','10/12/2021',1,0)
,('PatientA','10/13/2021',0,1)
,('PatientA','10/14/2021',1,0)
,('PatientA','10/15/2021',1,0)
,('PatientA','10/16/2021',1,0)
,('PatientA','10/17/2021',1,0)
,('PatientA','10/18/2021',1,0)
,('PatientA','10/19/2021',0,1)
,('PatientA','10/20/2021',1,0)
,('PatientA','10/21/2021',1,0)
,('PatientA','10/22/2021',1,0)
,('PatientA','10/23/2021',1,0)
,('PatientA','10/24/2021',1,0)
,('PatientA','10/25/2021',1,0)
,('PatientA','10/26/2021',1,0)
,('PatientA','10/27/2021',0,1)
,('PatientA','10/28/2021',1,0)
,('PatientA','10/29/2021',1,0)
,('PatientA','10/30/2021',1,0)
,('PatientA','10/31/2021',1,0)
,('PatientA','11/1/2021',1,0)
,('PatientA','11/2/2021',1,0)
,('PatientA','11/3/2021',1,0)
,('PatientA','11/4/2021',1,0)
,('PatientA','11/5/2021',1,0)
,('PatientA','11/6/2021',0,1)
,('PatientA','11/7/2021',0,1)
,('PatientA','11/8/2021',0,1)
,('PatientA','11/9/2021',0,1)
,('PatientA','11/10/2021',0,1)
,('PatientA','11/11/2021',0,1)
,('PatientA','11/12/2021',0,1)
,('PatientA','11/13/2021',0,1)
,('PatientA','11/14/2021',0,1)
,('PatientA','11/15/2021',0,1)
,('PatientA','11/16/2021',0,1)
,('PatientA','11/17/2021',0,1)
,('PatientA','11/18/2021',0,1)
,('PatientA','11/19/2021',0,1)
,('PatientA','11/20/2021',0,1)
,('PatientA','11/21/2021',0,1)
,('PatientA','11/22/2021',0,1)
,('PatientA','11/23/2021',0,1)
,('PatientA','11/24/2021',0,1)
,('PatientA','11/25/2021',0,1)
,('PatientA','11/26/2021',0,1)
,('PatientA','11/27/2021',0,1)
,('PatientA','11/28/2021',0,1)
,('PatientA','11/29/2021',0,1)
,('PatientA','11/30/2021',0,1)
,('PatientA','1/27/2022',0,1)
,('PatientA','1/28/2022',0,1)
,('PatientA','1/29/2022',0,1)
,('PatientA','1/30/2022',1,0)
,('PatientA','1/31/2022',0,1)
,('PatientA','2/1/2022',0,1)
,('PatientA','2/2/2022',1,0)
,('PatientA','2/3/2022',0,1)
,('PatientA','2/4/2022',0,1)
,('PatientA','2/5/2022',0,1)
,('PatientA','2/6/2022',0,1)
,('PatientA','2/7/2022',0,1)
,('PatientA','2/8/2022',0,1)
,('PatientA','2/9/2022',0,1)
,('PatientA','2/10/2022',0,1)
,('PatientA','2/11/2022',0,1)
,('PatientA','2/12/2022',0,1)
,('PatientA','2/13/2022',1,0)
,('PatientA','2/14/2022',0,1)
,('PatientA','2/15/2022',0,1)
,('PatientA','2/16/2022',0,1)
,('PatientA','2/17/2022',0,1)
,('PatientA','2/19/2022',0,1)
,('PatientA','2/20/2022',0,1)

The logic is a treatment ends on the 5th consecutive missed day. We want to know the start and end dates for each treatment "island". I didn't include a sample patient that has an active streak going, but for them I would want null for their end date. If you need data for an example I can create one.

For this patient the output would be:
PatientID | Start | End |
PatientA | 9/24/2021 | 11/10/2021|
PatientA | 1/30/2022 | 2/7/2022|
PatientA | 2/13/2022 | 2/19/2022|

Can someone please try to come up with logic to help with this? I was trying by have a running total for the missed treatments, once that hit 5 it was an end. But I couldn't figure out how to determine the starts.

Thank you again! Seems like a fun logic puzzle but I'm just drawing blanks now.

WITH Completes
AS
(
	SELECT PatientID, DueDate
		,CASE
			WHEN
				SUM(Missed) OVER 
					(
						PARTITION BY PatientID 
						ORDER BY DueDate ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
					) < 6
				AND 1 = MAX(Completed) OVER (PARTITION BY PatientID ORDER BY DueDate)
			THEN 1
			ELSE Completed
		END AS Completed
	FROM #temp
)
,Boundaries
AS
(
	SELECT PatientID, DueDate, Completed
		,CASE
			WHEN Completed =
				LAG(Completed) OVER (PARTITION BY PatientID ORDER BY DueDate)
			THEN 0
			ELSE 1
		END AS Boundary
	FROM Completes
)
,Grps
AS
(
	SELECT PatientID, DueDate, Completed
		,SUM(Boundary) OVER (PARTITION BY PatientID ORDER BY DueDate) AS Grp
	FROM Boundaries
)
,Results
AS
(
	SELECT PatientID, Completed
		,MIN(DueDate) AS StartDate
		,MAX(DueDate) AS EndDate
	FROM Grps
	GROUP BY PatientID, Completed, Grp
)
,Ends
AS
(
	SELECT PatientID, Completed, StartDate
		,CASE
			WHEN LEAD(StartDate) OVER (PARTITION BY PatientID ORDER BY StartDate) IS NOT NULL
			THEN EndDate
		END AS EndDate
	FROM Results
)
SELECT PatientID, StartDate, EndDate
FROM Ends
WHERE Completed = 1;
1 Like

Thank you so much Ifor! It's working really well but I found some scenarios it's incorrect on. One issue I wasn't clear about. If the patient's last record is prior to yesterday, that is automatically to be considered a treatment end, if the date is current through yesterday it would be null and considered ongoing (assuming the patient didn't happen to miss their 5th in a row). The other issue seems logic related. Not sure why it failed to capture a treatment end when the patient missed exactly 5 days in a row.

Your result:
PatientID | Start Date | End Date
PatientB | 12/17/20 | null

Expected result:
PatientID | Start Date | End Date
PatientB | 12/17/20 | 12/22/20
PatientB | 12/23/20 | 2/26/21

Data:
drop table if exists #temp
create table #temp
([PatientID] varchar(8),[DueDate] date,[Completed] int,[Missed] int)
insert into #temp
values
('PatientB','12/17/2020',1,0)
,('PatientB','12/18/2020',0,1)
,('PatientB','12/19/2020',0,1)
,('PatientB','12/20/2020',0,1)
,('PatientB','12/21/2020',0,1)
,('PatientB','12/22/2020',0,1)
,('PatientB','12/23/2020',1,0)
,('PatientB','12/24/2020',1,0)
,('PatientB','12/25/2020',1,0)
,('PatientB','12/26/2020',1,0)
,('PatientB','12/27/2020',1,0)
,('PatientB','12/28/2020',1,0)
,('PatientB','12/29/2020',1,0)
,('PatientB','12/30/2020',1,0)
,('PatientB','12/31/2020',1,0)
,('PatientB','1/1/2021',1,0)
,('PatientB','1/2/2021',1,0)
,('PatientB','1/3/2021',1,0)
,('PatientB','1/4/2021',1,0)
,('PatientB','1/5/2021',1,0)
,('PatientB','1/6/2021',1,0)
,('PatientB','1/7/2021',1,0)
,('PatientB','1/8/2021',1,0)
,('PatientB','1/9/2021',1,0)
,('PatientB','1/10/2021',1,0)
,('PatientB','1/11/2021',1,0)
,('PatientB','1/12/2021',1,0)
,('PatientB','1/13/2021',1,0)
,('PatientB','1/14/2021',1,0)
,('PatientB','1/15/2021',1,0)
,('PatientB','1/16/2021',1,0)
,('PatientB','1/17/2021',1,0)
,('PatientB','1/18/2021',1,0)
,('PatientB','1/19/2021',1,0)
,('PatientB','1/20/2021',1,0)
,('PatientB','1/21/2021',1,0)
,('PatientB','1/22/2021',1,0)
,('PatientB','1/23/2021',1,0)
,('PatientB','1/24/2021',1,0)
,('PatientB','1/25/2021',1,0)
,('PatientB','1/26/2021',1,0)
,('PatientB','1/27/2021',1,0)
,('PatientB','1/28/2021',1,0)
,('PatientB','1/29/2021',1,0)
,('PatientB','1/30/2021',1,0)
,('PatientB','1/31/2021',1,0)
,('PatientB','2/1/2021',1,0)
,('PatientB','2/2/2021',1,0)
,('PatientB','2/3/2021',0,1)
,('PatientB','2/4/2021',1,0)
,('PatientB','2/5/2021',1,0)
,('PatientB','2/6/2021',1,0)
,('PatientB','2/7/2021',1,0)
,('PatientB','2/8/2021',1,0)
,('PatientB','2/9/2021',1,0)
,('PatientB','2/10/2021',1,0)
,('PatientB','2/11/2021',1,0)
,('PatientB','2/12/2021',1,0)
,('PatientB','2/13/2021',1,0)
,('PatientB','2/14/2021',1,0)
,('PatientB','2/15/2021',1,0)
,('PatientB','2/16/2021',1,0)
,('PatientB','2/17/2021',1,0)
,('PatientB','2/18/2021',1,0)
,('PatientB','2/19/2021',1,0)
,('PatientB','2/20/2021',1,0)
,('PatientB','2/21/2021',1,0)
,('PatientB','2/22/2021',1,0)
,('PatientB','2/23/2021',1,0)
,('PatientB','2/24/2021',0,1)
,('PatientB','2/25/2021',0,1)
,('PatientB','2/26/2021',0,1)

Thank you again for the help, such a clever solution!

While using multiple CTEs to consecutively apply windowed functions while only reading the table once is potentially very efficient it can be difficult to anticipate edge conditions. ie You need to understand the code and be prepared to alter it.

With regards to your questions:

  1. Your new explaination of the EndDate simplifies the code.
  2. This is an edge condition of 5 misses immediately followed by a complete. Here is a quick attempt which seems to satisfy your data. Any more alterations are up to you.
WITH NumMissed
AS
(
	SELECT PatientID, DueDate, Completed, Missed
		,SUM(Missed) OVER 
			(
				PARTITION BY PatientID 
				ORDER BY DueDate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
			) AS NumMissed
	FROM #temp
)
,PrevNumMissed
AS
(
	SELECT PatientID, DueDate, Completed, Missed, NumMissed
		,LAG(NumMissed) OVER (PARTITION BY PatientID ORDER BY DueDate) AS PrevNumMissed
	FROM NumMissed
)
,Treatments
AS
(
	SELECT PatientID, DueDate, NumMissed, PrevNumMissed
		,CASE
			WHEN
			(
				(NumMissed < 5 AND NOT PrevNumMissed IS NULL)
				OR (NumMissed = 5
					AND COALESCE(PrevNumMissed, 0) < 5)
			)
			THEN 1
			ELSE Completed
		END AS Treatment
	FROM PrevNumMissed
)
,Boundaries
AS
(
	SELECT PatientID, DueDate, Treatment
		,CASE
			WHEN Treatment =
				LAG(Treatment) OVER (PARTITION BY PatientID ORDER BY DueDate)
				-- This accounts for the new boundary condition
				AND NOT (NumMissed = 4 AND PrevNumMissed = 5)
			THEN 0
			ELSE 1
		END AS Boundary
	FROM Treatments
)
,Grps
AS
(
	SELECT PatientID, DueDate, Treatment
		,SUM(Boundary) OVER (PARTITION BY PatientID ORDER BY DueDate) AS Grp
	FROM Boundaries
)
,Results
AS
(
	SELECT PatientID
		,MIN(DueDate) AS StartDate
		,MAX(DueDate) AS EndDate
	FROM Grps
	WHERE Treatment = 1
	GROUP BY PatientID, Grp
)
SELECT PatientID, StartDate
	,CASE
		WHEN EndDate < DATEADD(day, DATEDIFF(day, 1, CURRENT_TIMESTAMP), 0)
		THEN EndDate
	END AS EndDate
FROM Results
ORDER BY PatientID, StartDate;
1 Like

Thank you! I tested 6 patients and it seems to work perfectly. That's some brilliant SQL sorcery there. I struggle with CTEs so I broke each step down into temp tables so I can try to understand what's happening. Really appreciate the help.