Select Only the Oldest Duplicate Record

I have a query to find duplicate document records in my db and it works fine, finding 25,000 records in 3 seconds. What I now need to do is get the earliest duplicate record only for each set of duplicates based on the ORDNO field. What is the best way to efficiently return these?

This is my query currently:

SELECT dh.[DATE]
	,dh.ORDNO
	,dh.DOCTYPE
	,dh.DOCNAME
	,da.RECID
	,(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN (
	SELECT DOCHIST.*
		,COUNT(*) OVER (
			PARTITION BY ORDNO
			,DOCNAME
			) AS cnt
	FROM DOCHIST
	LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
	WHERE DOCHIST.DOCTYPE = 35
		AND DOCHIST.DOCNAME = 'Delivery Confirmation'
		AND CONTRACTS.[STATUS] = 4
	) dh ON dh.RECID = da.PARID
WHERE cnt > 1
ORDER BY ORDNO
	,[DATE] ASC

Here are some example results:

+-------------------------+------------+---------+-----------------------+----------------------------------+
|          DATE           |   ORDNO    | DOCTYPE |        DOCNAME        |              RECID               |
+-------------------------+------------+---------+-----------------------+----------------------------------+
| 2016-03-04 10:03:07.000 | 0000300007 |      35 | Delivery Confirmation | 20DC84F6FD5F4A3FB9E8786F62EDA35E |
| 2016-03-07 17:07:01.000 | 0000300007 |      35 | Delivery Confirmation | 4804E3FE819A490C90741A7BFD7E9754 |
| 2016-03-14 11:13:11.000 | 0000300007 |      35 | Delivery Confirmation | 7AFABA2B9B42419B981F23F5FCE62705 |
| 2016-03-04 09:55:55.000 | 0000300024 |      35 | Delivery Confirmation | CE71A9D496CF410AB74363FBCA20B9C8 |
| 2016-03-04 09:57:34.000 | 0000300024 |      35 | Delivery Confirmation | 2A130C9E0E404D999CE2952511DF410D |
| 2016-03-01 15:54:35.000 | 0000300031 |      35 | Delivery Confirmation | F1B4836FDCA14299A6D80E72C81F73C4 |
| 2016-04-25 09:23:09.000 | 0000300031 |      35 | Delivery Confirmation | 85E285304D114286B3E352505B0B0F74 |
+-------------------------+------------+---------+-----------------------+----------------------------------+

From these results, I would only want to see the row dated 04/03/2016 10:03 for ORDNO 0000300007, and only the row dated 04/03/2016 09:55 for ORDNO 0000300024 etc. I need to see all the columns in the result.

How can I do this without slowing the query too much?

Any help would be appreciated.

Many thanks
Martyn

You can add Row_number() over (partition by Dup Columns order by Date). Then select all where this = 1

You have a LEFT JOIN which isn't a left join as you have a WHERE cnt > 1, so it is the same as an inner join (also you haven't prefixed cnt with the table alias, is it 'dh'?).
Does this work?

SELECT dh.[DATE],
       dh.ORDNO,
       dh.DOCTYPE,
       dh.DOCNAME,
       da.RECID,
       (DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
  FROM ihdocarchive.dbo.DocArchive da
 CROSS APPLY (SELECT TOP(1) 
                     DOCHIST.*
	            FROM DOCHIST
	            LEFT JOIN CONTRACTS 
                       ON CONTRACTS.CONTNO = DOCHIST.ORDNO
	           WHERE DOCHIST.DOCTYPE = 35
		         AND DOCHIST.DOCNAME = 'Delivery Confirmation'
                 AND CONTRACTS.[STATUS] = 4
                 AND dh.RECID = da.PARID
              HAVING COUNT(*) OVER (PARTITION BY ORDNO, DOCNAME) > 1
	        ) dh          
 ORDER BY ORDNO, [DATE] ASC

Thanks but I get 3 error messages with this:

Msg 8121, Level 16, State 1, Line 17
Column 'DOCHIST.ORDNO' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 8121, Level 16, State 1, Line 17
Column 'DOCHIST.DOCNAME' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 4108, Level 15, State 1, Line 17
Windowed functions can only appear in the SELECT or ORDER BY clauses.

Thanks, I tried like this:

SELECT dh.[DATE],
	   dh.ORDNO,
       dh.DOCTYPE,  
	   dh.DOCNAME,
	   da.RECID,
	   (DATALENGTH(da.DOCUMENT)/1024)/1024 AS SIZE
FROM   ihdocarchive.dbo.DocArchive da
LEFT JOIN
(SELECT DOCHIST.*,
			   ROW_NUMBER() OVER (PARTITION BY ORDNO, DOCNAME ORDER BY [DATE]) rn
        FROM  DOCHIST LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO WHERE DOCHIST.DOCTYPE = 35 AND DOCHIST.DOCNAME = 'Delivery Confirmation' AND CONTRACTS.[STATUS] = 4) dh on dh.RECID = da.PARID

WHERE rn = 1

ORDER BY ORDNO, [DATE] ASC

But this returns a record for every ORDNO that has a Delivery Confirmation, not just those ORDNO that have more than one Delivery Confirmation. My original query returned 24,000 records, so I was expecting less than half this number. What am I missing?

You could be having more columns than required
Thats when this happens

Example
select col1 , max() from
if its
select col1 ,col2 , col3 , max() from ..this will return a lot more rows

1 Like

If you provide DDL and sample data, we can probably help you out. Your Row_number function is partitioning by OrdNo and DocName, so you won't get 1 record per OrdNo, but you'll get 1 record per OrdNo and DocName

2 Likes

WITH ContractsWithTheMinOrdNO (OrdNo, [Date])
AS
(
SELECT dh.ORDNO, MIN(dh.[Date])
DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
)
SELECT dh.[DATE]
,dh.ORDNO
,dh.DOCTYPE
,dh.DOCNAME
,da.RECID
,(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN (
SELECT DOCHIST.*
,COUNT(*) OVER (
PARTITION BY ORDNO
,DOCNAME
) AS cnt
FROM DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
) dh ON dh.RECID = da.PARID
INNER JOIN ContractsWithTheMinOrdNO ON cnt.OrdNo=ContractsWithTheMinOrdNO.OrdNo AND
cnt.[Date] =ContractsWithTheMinOrrdNO.[Date]

something like this should work. Maybe a few syntac errors but google for common table expressions should do the trick.

WITH common_table_expression (Transact-SQL) - SQL Server | Microsoft Docs

Many thanks, I was able to get it working as I wanted when I used only the column ORDNO.