How to delete duplicates values based on a column value

Hi, I need some help if possible on how I can delete values if they are duplicated in my table. I can identify rows in my data that a duplicated using the code below, but how do I remove the duplicated call id?

select CallID
from gbr_calldetail
group by callid
Having count(callid) > 1

Thanks in advance

David

;WITH cte_deletes AS (
    select CallID, row_number() over(partition by CallID order by newid()) as row_num
    from gbr_calldetail
)
DELETE FROM cte_deletes
WHERE row_num > 1
2 Likes

Thank you Scott, much appreciated.

I have to ask, Scott... Why are you using an ORDER BY on a Random GUID value? That's just going to slow things down.

Scott is absolutely on the right track on this timeless problem (it was the first "problem" I had to solve way back in SQL Server 7 and it was for removing duplicate CDRs, as well).

The only issue is that the ORDER BY of the ROW_NUMBER() function can't be absent and it cannot be a constant/literal and so Scott added a function, which is one of the ways to pull it off without having to use a column from the table for the sort order.

Unfortunately, sorts are very expensive and sorting by a Random 16 byte GUID is relatively quite expensive and that expense isn't necessary. You can do the same thing and have it run faster (sometimes, a lot faster) by simulating a constant. In the following, we use a SELECT NULL to simulate the constant, which is also a grand indication that we're just satisfying a syntactical requirement and doing so at the lowest sort cost possible.

Here's the code that I'd use for this...

   WITH cte_deletes AS 
(
 SELECT  CallID
        ,RowNum =  ROW_NUMBER() OVER(PARTITION BY CallID ORDER BY (SELECT NULL))
   FROM dbo.gbr_calldetail
)
 DELETE FROM cte_deletes
  WHERE RowNum > 1
;

You can also do a little trick with @@SPID, which is shorter and is probably a bit faster still, but it might lead to a bit of confusion if a neophyte has to figure out what the code is doing.

   WITH cte_deletes AS 
(
 SELECT  CallID
        ,RowNum =  ROW_NUMBER() OVER(PARTITION BY CallID ORDER BY @@SPID)
   FROM dbo.gbr_calldetail
)
 DELETE FROM cte_deletes
  WHERE RowNum > 1
;

Having cut my teeth in SQL on CDR processing, I'll also state that there's usually some for of "created on" column (in other words, a "record" date) and most PUCs demand that you keep the earliest "record" of any call (if it's for a LEC or CLEC or other billing entity). If that's true for you, then you should use whatever "created on" date column you may have as the object of the ORDER BY in the ROW_NUMBER function.

1 Like

But NULL is indeterminate so end you end up with dup keys. Itzik Ben-Gan has written about dangers of indeterminate sorts, although I'm not sure if it applies in this specific case.

If there's a unique value in the table that is the lead column in the clustering key, specify that instead, since SQL will "know" it doesn't have to actually do a sort to return the rows in order.

Correct on the indeterminate sort thing but even Itzik uses (SELECT NULL) and other things for this type of thing. It's a DUPE check and on rows with no temporal indications (I DID post about that above), it doesn't matter which one is kept.

The CallID in this case is supposed to be the PK and that's why the OP deleting dupes based on that column. Since they're trying to delete dupes, it's probably an import of CDRs and, since I've been on the receiving end of that a whole lot, I can tell you that AT&T duplicates CDRs all the bloody time. They're not the only vendor that duplicates CDRs, either.

If you insist on making the ORDER BY determinant even though no column from the table is required, then use NEWSEQUENTIALID() instead of NEWID() but, seriously, it's just not required for this type of dupe check.

1 Like