SQLTeam.com | Weblogs | Forums

Help with deleting one of "nearly" duplicate records


#1

Good day all - will start out by letting you know that my experience with SQL is limited to the most basic types of queries, generally run from VBA behind a user form. So I do apologize if this is simply ridiculous or if I'm not explaining it properly ...

I have a table, named News_Indexing_QA, which currently contains less than 20,000 records, but will be growing quickly. As part of the initial clean up, and as ongoing maintenance, I need to be able to identify and remove all but one record where others that are nearly identical exist. The columns that I can use to determine uniqueness are:
NSN_ID
Code
... In other words, the combined values of these two fields should only exist one time (ever).

Would any of you be kind enough to assist me with this? If it matters, I'm using Microsoft SQL Server Management Studio 2008 R2.

Thank you in advance for any guidance.
Regards!


#2

You can use the following query to see which sets have dups, and the number of duplicate entries

SELECT 
	NSN_ID,
	Code,
	COUNT(*) AS DupCount
FROM
	News_Indexing_QA
GROUP BY
	NSN_ID,
	Code
HAVING
	COUNT(*) > 1;

To delete the dups, the following query:

;WITH cte AS
(
	SELECT ROW_NUMBER() OVER( PARTITION BY NSN_ID, Code ORDER BY NSN_ID, Code) AS RN
	FROM News_Indexing_QA
)
DELETE cte WHERE RN > 1;

When there is more than one row for a given combination of NSN_ID and Code, and if you have a preference as to which of those rows to keep, change the ORDER BY clause in the ROW_NUMBER function so that the one that you want to keep comes first in the ordering scheme.


#3

@JamesK, thank you so much for such a quick reply.
I ran the first query and the results are exactly what I'd expect. When I run the second one, as you have it without modifying anything, will it delete all of the "near dupes" or will it retain one?

FYI, it makes no difference to me which are retained, as long as at least one remains in the db.

Thank you again!


#4

It will retain one of the dups. So after the delete you will still end up with the same number of distinct combinations of NSN_ID and Code.

Run the query shown below:

SELECT ROW_NUMBER() OVER( PARTITION BY NSN_ID, Code ORDER BY NSN_ID, Code) AS RN,
*
FROM News_Indexing_QA
ORDER BY NSN_ID, Code

That will show you all the rows and columns in the table, along with a new column RN. When you run the original query I posted, you are going to be deleting all the rows that have RN > 1


#5

You rock... thank you so much!!!!