I have a query to find duplicate document records in my db and it works fine, finding 25,000 records in 3 seconds. What I now need to do is get the earliest duplicate record only for each set of duplicates based on the ORDNO field. What is the best way to efficiently return these?
This is my query currently:
SELECT dh.[DATE]
,dh.ORDNO
,dh.DOCTYPE
,dh.DOCNAME
,da.RECID
,(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN (
SELECT DOCHIST.*
,COUNT(*) OVER (
PARTITION BY ORDNO
,DOCNAME
) AS cnt
FROM DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
) dh ON dh.RECID = da.PARID
WHERE cnt > 1
ORDER BY ORDNO
,[DATE] ASC
Here are some example results:
+-------------------------+------------+---------+-----------------------+----------------------------------+
| DATE | ORDNO | DOCTYPE | DOCNAME | RECID |
+-------------------------+------------+---------+-----------------------+----------------------------------+
| 2016-03-04 10:03:07.000 | 0000300007 | 35 | Delivery Confirmation | 20DC84F6FD5F4A3FB9E8786F62EDA35E |
| 2016-03-07 17:07:01.000 | 0000300007 | 35 | Delivery Confirmation | 4804E3FE819A490C90741A7BFD7E9754 |
| 2016-03-14 11:13:11.000 | 0000300007 | 35 | Delivery Confirmation | 7AFABA2B9B42419B981F23F5FCE62705 |
| 2016-03-04 09:55:55.000 | 0000300024 | 35 | Delivery Confirmation | CE71A9D496CF410AB74363FBCA20B9C8 |
| 2016-03-04 09:57:34.000 | 0000300024 | 35 | Delivery Confirmation | 2A130C9E0E404D999CE2952511DF410D |
| 2016-03-01 15:54:35.000 | 0000300031 | 35 | Delivery Confirmation | F1B4836FDCA14299A6D80E72C81F73C4 |
| 2016-04-25 09:23:09.000 | 0000300031 | 35 | Delivery Confirmation | 85E285304D114286B3E352505B0B0F74 |
+-------------------------+------------+---------+-----------------------+----------------------------------+
From these results, I would only want to see the row dated 04/03/2016 10:03 for ORDNO 0000300007, and only the row dated 04/03/2016 09:55 for ORDNO 0000300024 etc. I need to see all the columns in the result.
How can I do this without slowing the query too much?
Any help would be appreciated.
Many thanks
Martyn