I have a query to find duplicate document records in my db and it works fine, finding 25,000 records in 3 seconds. What I now need to do is get the earliest duplicate record only for each set of duplicates based on the ORDNO field. What is the best way to efficiently return these?
This is my query currently:
SELECT dh.[DATE]
,dh.ORDNO
,dh.DOCTYPE
,dh.DOCNAME
,da.RECID
,(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN (
SELECT DOCHIST.*
,COUNT(*) OVER (
PARTITION BY ORDNO
,DOCNAME
) AS cnt
FROM DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
) dh ON dh.RECID = da.PARID
WHERE cnt > 1
ORDER BY ORDNO
,[DATE] ASC
From these results, I would only want to see the row dated 04/03/2016 10:03 for ORDNO 0000300007, and only the row dated 04/03/2016 09:55 for ORDNO 0000300024 etc. I need to see all the columns in the result.
How can I do this without slowing the query too much?
You have a LEFT JOIN which isn't a left join as you have a WHERE cnt > 1, so it is the same as an inner join (also you haven't prefixed cnt with the table alias, is it 'dh'?).
Does this work?
SELECT dh.[DATE],
dh.ORDNO,
dh.DOCTYPE,
dh.DOCNAME,
da.RECID,
(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
CROSS APPLY (SELECT TOP(1)
DOCHIST.*
FROM DOCHIST
LEFT JOIN CONTRACTS
ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
AND dh.RECID = da.PARID
HAVING COUNT(*) OVER (PARTITION BY ORDNO, DOCNAME) > 1
) dh
ORDER BY ORDNO, [DATE] ASC
Msg 8121, Level 16, State 1, Line 17
Column 'DOCHIST.ORDNO' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 8121, Level 16, State 1, Line 17
Column 'DOCHIST.DOCNAME' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 4108, Level 15, State 1, Line 17
Windowed functions can only appear in the SELECT or ORDER BY clauses.
SELECT dh.[DATE],
dh.ORDNO,
dh.DOCTYPE,
dh.DOCNAME,
da.RECID,
(DATALENGTH(da.DOCUMENT)/1024)/1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN
(SELECT DOCHIST.*,
ROW_NUMBER() OVER (PARTITION BY ORDNO, DOCNAME ORDER BY [DATE]) rn
FROM DOCHIST LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO WHERE DOCHIST.DOCTYPE = 35 AND DOCHIST.DOCNAME = 'Delivery Confirmation' AND CONTRACTS.[STATUS] = 4) dh on dh.RECID = da.PARID
WHERE rn = 1
ORDER BY ORDNO, [DATE] ASC
But this returns a record for every ORDNO that has a Delivery Confirmation, not just those ORDNO that have more than one Delivery Confirmation. My original query returned 24,000 records, so I was expecting less than half this number. What am I missing?
If you provide DDL and sample data, we can probably help you out. Your Row_number function is partitioning by OrdNo and DocName, so you won't get 1 record per OrdNo, but you'll get 1 record per OrdNo and DocName
WITH ContractsWithTheMinOrdNO (OrdNo, [Date])
AS
(
SELECT dh.ORDNO, MIN(dh.[Date])
DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
)
SELECT dh.[DATE]
,dh.ORDNO
,dh.DOCTYPE
,dh.DOCNAME
,da.RECID
,(DATALENGTH(da.DOCUMENT) / 1024) / 1024 AS SIZE
FROM ihdocarchive.dbo.DocArchive da
LEFT JOIN (
SELECT DOCHIST.*
,COUNT(*) OVER (
PARTITION BY ORDNO
,DOCNAME
) AS cnt
FROM DOCHIST
LEFT JOIN CONTRACTS ON CONTRACTS.CONTNO = DOCHIST.ORDNO
WHERE DOCHIST.DOCTYPE = 35
AND DOCHIST.DOCNAME = 'Delivery Confirmation'
AND CONTRACTS.[STATUS] = 4
) dh ON dh.RECID = da.PARID
INNER JOIN ContractsWithTheMinOrdNO ON cnt.OrdNo=ContractsWithTheMinOrdNO.OrdNo AND
cnt.[Date] =ContractsWithTheMinOrrdNO.[Date]
something like this should work. Maybe a few syntac errors but google for common table expressions should do the trick.