Need to remove duplicate begin time

Pasi · August 25, 2015, 5:26pm

HI,

I have below query but when I run it I get duplicate apt time (begin time) and I need to remove dups if begin time is the same for the same patient, show only one.
Thanks,
Pasi.

select distinct location_name,pm.description,
CASE
WHEN a.appt_kept_ind = 'Y' THEN 'Kept'
WHEN a.cancel_ind = 'Y' THEN 'Cancelled'
WHEN a.appt_status = 'P' THEN 'Pending'
ELSE 'Expected'
END As 'Status',
p.cell_phone,
a.appt_date,
begintime,
pm.description,
a.description As 'Pat Name',
event,a.home_phone, a.user_defined1 As 'Reason for vist',a.create_timestamp
from appointments a
inner join location_mstr lm on lm.location_id = a.location_id
inner join events b on b.event_id = a.event_id
inner join person p on p.person_id=a.person_id
inner join provider_mstr pm on pm.provider_id = a.rendering_provider_id
where not b.event like '%flu%'and a.cancel_ind ='N'
and a.appt_date between '20150828' and '20150828'

JamesK · August 25, 2015, 5:43pm

Two rows are considered DISTINCT if even one column is different. So if you are getting two rows with the same patient_id and begintime, that means there are one more other columns in the select list that are different. Which of those two different values do you want to keep?

For example, assume that in the two duplicate rows all columns are the same, except for a.create_timestamp. Then, you can eliminate the duplicate row like in this example

SELECT * FROM
(
	SELECT DISTINCT
			location_name ,
			pm.description ,
			CASE WHEN a.appt_kept_ind = 'Y' THEN 'Kept'
				 WHEN a.cancel_ind = 'Y' THEN 'Cancelled'
				 WHEN a.appt_status = 'P' THEN 'Pending'
				 ELSE 'Expected'
			END AS 'Status' ,
			p.cell_phone ,
			a.appt_date ,
			begintime ,
			pm.description ,
			a.description AS 'Pat Name' ,
			event ,
			a.home_phone ,
			a.user_defined1 AS 'Reason for vist' ,
			a.create_timestamp,
			ROW_NUMBER() OVER (PARTITION BY p.person_id, begintime ORDER BY a.create_timestamp DESC) AS RN
	FROM    appointments a
			INNER JOIN location_mstr lm ON lm.location_id = a.location_id
			INNER JOIN events b ON b.event_id = a.event_id
			INNER JOIN person p ON p.person_id = a.person_id
			INNER JOIN provider_mstr pm ON pm.provider_id = a.rendering_provider_id
	WHERE   NOT b.event LIKE '%flu%'
			AND a.cancel_ind = 'N'
			AND a.appt_date BETWEEN '20150828' AND '20150828'
) s WHERE RN = 1;

Pasi · August 25, 2015, 5:50pm

Thanks James! good points. when I run the query I get this error:

Msg 8156, Level 16, State 1, Line 31
The column 'description' was specified multiple times for 's'.

Pasi · August 25, 2015, 5:55pm

Never mind I fixed it!

Kristen · August 25, 2015, 6:24pm

You need to decide which one, of the duplicates, to show.

As JamesK's code stands it will pick one at random (where person_id and begintime match it will take the one with the earliest create_timestamp, but if there are multiple records with all three columns t he same then the one chosen will be random - and may well be different next time you run the report / query.

I recommend that you include the Primary Key (or some other Unique) column(s) as a tie break on the ORDER BY (in the OVER statement). If any of the JOIN'd tables might yield more than one row then include the Primary Keys columns of ALL such JOIN'd tables too

Personally I would also want to "cure" the cause of the duplicates too!

Pasi · August 25, 2015, 6:45pm

Thanks Kristen,, when I ran the query it gave me more counts and I had to get ride of begin time, I am getting better counts now. Thank you both!

Pasi · August 25, 2015, 8:14pm

James,

what if I group it by and then say something like "having count(begintime) =1"? I just don't know how /what to group it by?
Pasi.

JamesK · August 25, 2015, 8:36pm

You can use GROUP BY, but that may not exactly be what you want. If you want to use GROUP BY, the rule you have to remember that EVERY column that is in the select list that is outside of an aggregate function such as MAX, SUM etc. should be listed in the GROUP BY list.

So, for example, if I wanted to pick only one value for the last two columns, I would do something like shown below. Every other column is listed in the group by section. The problem with this is that the MAX(a.user_defined1) may come from one row, and MAX(a.user_defined1) may come from another row. Most of the time, this is not what you want.

So my preference would be to use the ROW_NUMBER function as I had described earlier. You can list as many columns in the ORDER BY clause to make your selections unique, and you have control over which row gets picked based on that.

SELECT 
		location_name ,
		pm.description ,
		CASE WHEN a.appt_kept_ind = 'Y' THEN 'Kept'
			 WHEN a.cancel_ind = 'Y' THEN 'Cancelled'
			 WHEN a.appt_status = 'P' THEN 'Pending'
			 ELSE 'Expected'
		END AS 'Status' ,
		p.cell_phone ,
		a.appt_date ,
		begintime ,
		pm.description ,
		a.description AS 'Pat Name' ,
		event ,
		a.home_phone ,
		MAX(a.user_defined1) AS 'Reason for vist' ,
		MAX(a.create_timestamp) AS create_timestamp
FROM    appointments a
		INNER JOIN location_mstr lm ON lm.location_id = a.location_id
		INNER JOIN events b ON b.event_id = a.event_id
		INNER JOIN person p ON p.person_id = a.person_id
		INNER JOIN provider_mstr pm ON pm.provider_id = a.rendering_provider_id
WHERE   NOT b.event LIKE '%flu%'
		AND a.cancel_ind = 'N'
		AND a.appt_date BETWEEN '20150828' AND '20150828'
GROUP BY
		location_name ,
		pm.description ,
		CASE WHEN a.appt_kept_ind = 'Y' THEN 'Kept'
			 WHEN a.cancel_ind = 'Y' THEN 'Cancelled'
			 WHEN a.appt_status = 'P' THEN 'Pending'
			 ELSE 'Expected'
		END,
		p.cell_phone ,
		a.appt_date ,
		begintime ,
		pm.description ,
		a.description,
		event ,
		a.home_phone

Pasi · August 25, 2015, 9:11pm

Thank! will try this

JamesK · August 25, 2015, 9:14pm

What is the motivation for going with the GROUP BY approach in spite of the pitfalls that I had pointed out earlier? Are you having difficulties with getting the row_number method to work correctly?

Pasi · August 25, 2015, 9:51pm

Nothing just wanted to see what I get with that method.

Kristen · August 26, 2015, 11:09am

I have always imagined that SQL has to "work much harder" with a GROUP BY that included a lot of columns (such that we tended to use a GROUP BY on just the critical key fields, together with any Aggregate functions we needed, AND a MIN(xxx) on any tie-break column and then JOIN that back to the record to get any other columns the report needed.

That said, I've never done any benchmarks to see if it was "a lot of work" for SQL to have a GROUP BY on loads of columns, particularly wide VARCHAR columns.

I now find that the (more modern style) row_number method is a lot easier to work with in that regard

Pasi · August 26, 2015, 3:56pm

I believe so cause you have to declare all columns and then like you said , it works harder to get all the columns group them by and so on...

jcelko · August 29, 2015, 2:49am

I wish you had followed basic forum Netiquette and posted DDL. The problem is that your design is wrong. You still write code with 1950's bit flags! ARRGH

Your appoints are a classic state transition problem; I have to make an appointment before I can cancel or keep it; I have to keep that appointment at a single point in time.

Pasi · September 1, 2015, 8:19pm

Sorry but not everyone is like you Mr Bill gates! ARRGH