SQLTeam.com | Weblogs | Forums

Most efficient way of finding and storing where there are duplicate entries

tsql

#1

Hi

I'm just learning about CTE and the WITH statement. I find it trickier to write at the moment but which would you say is the best solution to finding duplicate values in my table rows ::-

`Output plan 1 :-

SELECT ID4, DATERECORD INTO TABLEMIX2
FROM TABLEMIX
GROUP BY ID4, DATERECORD
HAVING COUNT(ID3)>1 AND COUNT(ID4)>1 AND COUNT(DATERECORD)>1 –- ..10 more AND COUNT precedents to follow
ORDER BY ID4

Output plan 2 :-

WITH TESTTABLE_CTE (ID3, ID4, DATERECORD)
AS
(
SELECT ID3, ID4, DATERECORD
FROM dbo.TABLEMIX
GROUP BY ID3, ID4, DATERECORD
HAVING COUNT(ID3)>1 AND COUNT(ID4)>1 AND COUNT(DATERECORD)>1 –- ..10 more AND COUNT precedents to follow
)INSERT INTO TESTTABLE3 (ID3, ID4, DATERECORD)
SELECT ID3, ID4, DATERECORD FROM TESTTABLE_CTE

Output plan 3 :-

WITH TABLEMIX_CTE (ROWNUM, ID3, ID4, DATERECORD)
AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID3, ID4, DATERECORD ORDER BY ID4),ID3,ID4,DATERECORD –- ..10 more AND COUNT precedents to follow
AS ROWNUM
FROM TABLEMIX
)
INSERT INTO TESTTABLE3 (ID3, ID4, DATERECORD)
SELECT ID3, ID4, DATERECORD FROM TABLEMIX_CTE WHERE ROWNUM=2 -- to pick out any duplicate entry – if there are 2 or more then it will output for creating new table entry
`

Plan 1 creates the table first if it doesn;t exist also.

Cheers
James


#2

As I stated on the other forum, "It Depends". What do you want to do with the duplicates once you find them?


#3

Hi Jeff
I've opted for my Plan 3 now. I have started to like using the WITH statement which is another reason to use rather than the GROUP BY solution.

Cheers
James


#4

I don't usually ask questions like I did unless it's REALLY important to your post. The reason why I asked what you're going to do with the dupes once you find them is because ALL of the code you listed is EXTREMELY dangerous if you make the mistake of converting them to DELETEs and you only want to delete dupes and not the originals.


#5

Hi Jeff
I appreciate your concern. When I first started posting topics I was intending of removing the duplicate rows. Since then I found a Stored Procedure to correctly unfile the records which will remove them safely and works fine. The user is now refiling them back which I've found does't put the duplicate entries back. Also I've realised that it is unsafe to delete duplicate rows manually especially as it could mean breaking the integrity of the database, so I won't be going down that road. Having said that, it was a worthwhile excercise to look into as I am now starting to pick up three new commands, WITH, bcp.exe and found that I need to implement the cursor (FETCH) with an exec command to make the process a bit more automated of unfiling duplicates in future.
I've now also found the root cause for the duplicates - the user was using some 3rd party software that would rapidly auto-file their records which gets into trouble if filing more than about 50 records at once.
Quick question for you - I found that if I put say a SELECT statement above the WITH statement I'd get a syntax error - any idea why that happens?
Cheers
James


#6

In this case, let's start calling "WITH" by it's proper name... CTE, which stands for "Common Table Expression".

As for using a Cursor for your deletes, why do you think you need a cursor? There is one good reason but deleting 1 row at a time usually isn't it.

You also say that if you put a SELECT statement above the WITH statement, you get a syntax error. Did you terminate the SELECT with a semi-colon? CTE's require that the previous statement be properly terminated with a semi-colon.


#7

Hi Jeff
I've created a table in a separate data we just containing the Unique ID where there is a duplicate. I have found a proven Stored Procedure to unfile the unique IDs. I had an idea to schedule a weekly job to regenerate the table set of DISTINCT IDs and then tun the exec command against all the IDs found. There are only about 20 duplicates left now and I didn't think I'm going to break the integrity of the database by using the cursor. It also a maintenance type task so hopefully that's enough of a reason to use a cursor unless you know of an easier way :blush:

Cheers
James


#8

Why do you think a "maintenance type task" actually justifies the use of a cursor? And, it not a matter of breaking database integrity for 20 rows either. Actually, for someone that claims to be new at this, you come across a bit chippy.

If you are truly new at databases, then do you self a favor... pretend you're practicing the piano to be a concert pianist and always try to hit the right notes instead of making excuses for hitting the wrong ones. There will be times in the future where the use of cursors are the right thing (and this isn't one of them) but just pretend that cursors don't exist for the next 2-3 years and you'll get much better at writing code a lot faster because you won't always be falling back on only what you know.

I also have to ask...why would anyone use a cursor to delete 20 rows?


#9

Hi Jeff
I'm not just deleting 20 rows of a table. Suppose I have a reference table I've created especially to hold 20 ID numbers called DUPETB. I also have a reliable stored procedure called SPUNFILERESULTS which will unfile the duplicates throughout the database referencing at least 5 other tables. I could manually look down the DUPETB table and unfile the duplicate records one by one like this:-
exec SPUNFILERESULTS 13579
exec SPUNFILERESULTS 32145
exec SPUNFILERESULTS 29675
And repeat this exercise another 17 times which isn't too big a job I do admit.
However the customer asked me to make this an automated process which sounded like a good challenge to me - as I'm trying to improve my SQL in any case. I then searched Google and saw that cursors were mentioned. Other websites somehow used Views to do the same task but for whatever reasons I don't know much about Views.
I've been working in IT for nearly 20 years and decided it's time to take my skills up several levels - due to my lack of SQL skills I tend to talk to a customer more than a developer would so I'm used to finding other ways of doing things rather than just relying on understanding all the code.
So do you think I should still use cursors?
No pressure but I need to have all this wrapped up by Tuesday at the latest :smile:
Cheers James


#10

Are the other 5 tables that you reference child tables of where you're finding the duplicates?

And, no... I wouldn't use cursors for this but since you have "reliable" code that's working one row at a time and you need this to be wrapped up by Tuesday, there's not a hell of a lot of time for you to do anything else.

As a bit of a sidebar, there's something fundamentally wrong with the whole system if it's accumulating duplicates that need to be deleted from time to time.


#11

Hi Jeff
Yes that's correct, the other 5 tables are child tables containing duplicates and hundreds of times over. I think it's a fairly isolated case as I've checked about 6 other customers at random and they don't have this kind if issue. I think it's some auto-macro freeware that causing the issue which the customer shouldn't be using, to save time filing their reports manually one by one.
Although I need to have this wrapped up by tomorrow, there's nothing stopping me from working on it more to come up with a better solution. By working on it more I will be both doing my work and learning at the same time. Please let me know what other solution I could put in place.
Cheer
James


#12

So, the "Main" table contains duplicates and the 5 child tables contain duplicates. Are the duplicates in the child table just based on the fact that the "Main" table has duplicates (by key column) or do they have a whole 'nuther problem?

In other words, if we identify duplicate rows in the "Main" table and we were to delete all rows in the child tables associated with the PK in the Main table, does that correctly delete all of the duplicates in the child tables?


#13

Hi Jeff
The "Main" table is a really a custom table I created which just has one column called REPORTID which contains a list of all the unique duplicate IDs. The child tables are within the customer's database. There might be more than 5 child tables - The unfiling of the reports is all handled by the stored procedure successfully.

So yes, if I were to run the stored procedure against any of the REPORTIDs in my custom table then it would successfully "unfile" all of the report along with any of the duplicates. I've checked the code in the stored procedure anf found that it appears to update the child tables rather than delete them. The main thing is that the outcome is good as the customer will find that the record no longer hangs their system anymore when they access it using our application - even if they decide to refile a report again. To make this process more automated, what would you do?

Cheers
Matt


#14

Can't tell. You've not posted the DDL for the tables and the related indexes. Can't even make a guess without that.

So... based on your above and then the previous posts, are you "James" or "Matt"?


#15

Hi Jeff
I can't send you the DDL - Ive signed a contract so not able to give that much information. Also I prefer to be called James but some people call me Matt as well - it's a long story so will save that for another day:)
I'd just like to say thanks for all your help so far with this topic and any others. I think this topic has gone as far as it can for me now.
Cheers
James