Hi Guys,
I want to remove duplicates from the table. The duplicate is a typo in the first and last name.
Here is the sample script
CREATE TABLE test_dup(
f_name nvarchar(20)
,l_name nvarchar(20)
,state nvarchar(20)
)
INSERT INTO dbo.test_dup
(
f_name,
l_name,
state
)
SELECT 'James','Smith','CA'
UNION all
select 'Smith','James','CA'
UNION all
select 'a','m','CA'
UNION all
select 'j','k','NY'
select * from dbo.test_dup
Here is the sample data
fname,lname,state
James,Smith,CA
Smith,James,CA
a,m,CA
j,k,NY
Here is what I want. (Only three unique records)
f_name,l_name,state
James,Smith,CA
a,m,CA
j,k,NY
Please let me know how can I accomplish this.
Thanks.
for the duplicates what makes the one in your desired list the one that gets picked? Maybe there is a created date or some other unique missing field?
with src
as
(
SELECT 'James' as fn,'Smith' as ln,'CA' as c
UNION all
select 'Smith','James','CA'
UNION all
select 'a','m','CA'
UNION all
select 'j','k','NY'
)
select top 1 t.*
From src src
join src t on src.fn = t.ln
and src.ln = t.fn
and src.c = t.c
union
select src.*
From src src
left join src t on src.fn = t.ln
and src.ln = t.fn
and src.c = t.c
where t.fn is null
and t.ln is null and t.c is null
Like was said over at SQLServerCentral.com, this is a really bad idea. As you said, the data is wrong. You need to convince management that the data needs to be fixed.
1 Like
WARNING! BEFORE YOU USE THE CODE LISTED ABOVE, PLEASE READ THE FOLLOWING!
If someone is looking for a code solution to ignore duplicates then, sorry, the code above shouldn't be it. You will end up leaving out people that have legitimate names.
For example, I have a very good friend on a different site by the name of Michael John. I also have a work acquaintance whose name is John Michael.
If you have known "typo's", then fix the data instead of making this terrible mistake.
To check for possible such typo's as candidate rows that need manual verification, then do something like the following (I played this against the AdventureWorks database) and then manually verify whether or not the entries are correct or not You might even want to add a HasBeenVerified column to your table so they don't show up in such validation queries in the future.
Here's the code...
SELECT lftBusinessEntityID = lft.BusinessEntityID
,rgtBusinessEntityID = rgt.BusinessEntityID
,lftFirstName = lft.FirstName
,lftLastName = lft.LastName
,rgtFirstName = rgt.FirstName
,rgtLastName = rgt.LastName
FROM person.Person lft
JOIN person.Person rgt
ON lft.FirstName = rgt.LastName
AND lft.LastName = rgt.FirstName
AND lft.BusinessEntityID < rgt.BusinessEntityID
ORDER BY lftFirstName,rgtLastName
,lftLastName ,rgtFirstName
;
... and here's the output from that code...
ALL of those First/Last name combinations are valid and common name combinations. Ignoring any of them with your code can be a serious mistake on your part. You MUST manually verify which ones are valid and FIX the ones than aren't.
2 Likes
A bit over dramatic but agree with you @JeffModen 100%
Maybe OP can use the query to find these duplicates instead of using it to remove duplicates
1 Like
If you've been through what I've been through with this type of mistake by others in companies that I've worked for, you wouldn't think it was over dramatic. In fact, you'd be congratulating me on my restraint.
1 Like