SQLTeam.com | Weblogs | Forums

Remove duplicate rows from a large table in SQL Server


#1

does anyone know the best way to remove duplicate rows from a large table? Here's what I have so far:
MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

Thanks!


#2

I have below observation here
If table has primary key column, key column won't allow duplicate column values.


#3

Delete duplicate rows using CTE:

-- BEGIN TRANSACTION
;WITH T_CTE
AS
(
	SELECT	[T_RowNumber] = ROW_NUMBER()
		OVER (
			PARTITION BY DupCol1, DupCol2, ...
			ORDER BY DupCol1, DupCol2, ..., TieBreakCol
		)
	FROM	#TEMP
	WHERE	... any restriction on which rows to compare ...
)
DELETE	D
FROM	T_CTE AS D
WHERE	T_RowNumber >= 2
--
-- COMMIT
-- ROLLBACK

In the "DupCol1, DupCol2" list include all columns which must match to be considered a duplicate

"TieBreakCol" should be a column used as a tie-break to decide which row to KEEP. For example a ModifyDate or an ID number. Note that the FIRST row, i.e. sorted LOW, will be retained, so use DESC if you need the HIGHEST / LATEST value retained. If the tie-break column is NOT unique then the row chosen will be at random, so you may also want to include a unique key, e.g. the primary key(s), so that the process is repeatable :smile:

If the number of rows deleted is very large (more than 10,000 say) then you may need to do this in batches, particularly if the delete operation will be performed whilst the database is in use.

If the rows have a unique identifier (Primary Key would be a good candidate, if you have one, most especially if it is also the Clustered Index) then use the above code to just collect the key(s) for the records to be deleted into a #TEMP table, and then use that to physically remove the rows, probably something like the code below.

This code sample has configurable options for limits on number of deletes per iteration, max iterations, max time of any single iteration, and max overall runtime (e.g. abort after maintenance window expires). It must NOT be run inside a transaction. It includes a SLEEP (default 3 seconds) each iteration, which will stop it hogging resources and allow other process to run.

(If the single-iteration time causes an abort you might adjust the code so that the Batch Size is adjusted, downwards, after a slow iteration and back upwards, again, after a fast iteration, it will then auto-adjust to varying load on the server. In practice I have found that some delete iterations are slower than others (secondary index cleanup etc., checking for foreign key rows and so on)

;WITH T_CTE
AS
(
	SELECT	[T_RowNumber] = ROW_NUMBER()
		OVER (
			PARTITION BY DupCol1, DupCol2, ...
			ORDER BY DupCol1, DupCol2, ..., TieBreakCol
		)
	FROM	#TEMP
	WHERE	... any restriction on which rows to compare ...
)
SELECT	[T_ID] = IDENTITY(int, 1, 1),
	KeyCol1, KeyCol2, ...
INTO	#TEMP
FROM	T_CTE AS D
WHERE	T_RowNumber >= 2
ORDER BY KeyCol1, KeyCol2, ...	-- Works best if sorted by Clustered Index keys
--
-- NOTE: This delete section must NOT run inside a transaction!!
--	This section can be aborted (any deletes completed will not be rolled back)
--	This section has configurable options for Max Batches and Max Iteration time limits
--	and Max elapsed runtime (e.g. so that it completes within a maintenance window)
--
--
-- NOTE: #TEMP table must be pre-populated with the Key Columns of rows to be deleted
--	Replace KeyCol1, KeyCol2 etc. below with the actual Key Column names
--	Replace MyTable below with the name of the target table

DECLARE	@intErrNo int,
	@intRowCount int,
	@intLoopCount int,
	@intLoopCountLimit int,
	@intTotalRowCount int,
	@intElapsed int,
	@intIterationElapsedLimit int,
	@intTotalElapsedLimit int,
	@intBatchSize int,
	@intOffset int,
	@intExitReason int,
--	@intExitReason:
--		0=Success,
--		1=Error (see @intErrNo),
--		2=Max time exceeded,
--		3=Max iteration time exceeded,
--		4=Loops limit exceeded
	@dtStart datetime,
	@dtIterationStart datetime

SELECT	@intBatchSize = 10000,			-- Rows to delete (per iteration)
	@intIterationElapsedLimit = 15000,	-- Abort if a single delete exceeds this (ms) - NULL=Infinity
	@intTotalElapsedLimit = 60,		-- Abort if still running after this amount of time (minutes) - NULL=Infinity
	@intLoopCountLimit = 200		-- Maximum number of loops - NULL=Infinity

SELECT	@intErrNo = 0,		-- Reset
	@intRowCount = 1,	-- Force first iteration
	@intLoopCount = 1,
	@intTotalRowCount = 0,
	@intElapsed = 0,
	@intOffset = 1,
	@dtStart = GetDate()	-- Start time for the whole batch

WHILE	    @intRowCount >= 1
BEGIN
	SELECT	@dtIterationStart = GetDate()	-- Start time for THIS iteration
	DELETE D
	FROM	#TEMP AS T
		JOIN dbo.MyTable AS D
			 ON D.KeyCol1 = T.KeyCol1
			AND D.KeyCol2 = T.KeyCol2
			...
	WHERE	    T.T_ID >= @intOffset
		AND T.T_ID <  @intOffset + @intBatchSize
	SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT

	SELECT	@intLoopCount = @intLoopCount + 1,
		@intTotalRowCount = @intTotalRowCount + @intRowCount,
		@intOffset = @intOffset + @intBatchSize,
		@intElapsed = DATEDIFF(Millisecond, @dtIterationStart, GetDate())

	IF @intErrNo <> 0
	BEGIN
		SELECT	@intRowCount=0,	-- Abort if ERROR
			@intExitReason = 1
	END

	IF @intTotalElapsedLimit IS NOT NULL
	AND @intTotalElapsedLimit < DateDiff(Minute, @dtStart, GetDate())
	BEGIN
		SELECT	@intRowCount=0,	-- Abort - total elapsed time limit exceeded
			@intExitReason = 2
	END

	IF @intIterationElapsedLimit IS NOT NULL
	AND @intElapsed > @intIterationElapsedLimit
	BEGIN
		SELECT	@intRowCount=0,	-- Abort - single-iteration time limit exceeded
			@intExitReason = 3
	END

	IF @intLoopCountLimit IS NOT NULL
	AND @intLoopCount > @intLoopCountLimit
	BEGIN
		SELECT	@intRowCount=0,	-- Abort - Loop-count limit exceeded
			@intExitReason = 4
	END

/** ENABLE FOR DEBUGGING *
	SELECT	[@intLoopCount]=@intLoopCount
		, GetDate()
		, [Elapsed] = @intElapsed
		, [@intErrNo]=@intErrNo
		, [@intRowCount]=@intRowCount
		, [@intTotalRowCount]=@intTotalRowCount
		, [@intOffset]=@intOffset
	RAISERROR (N'Waitfor', 10, 1) WITH NOWAIT	-- Force the debug message to display
* ENABLE FOR DEBUGGING **/

	WAITFOR DELAY '00:00:03'	-- SLEEP to allow other processes to run
END	-- @intRowCount >= 1

IF @intErrNo <> 0
BEGIN
	... handle the error ...
END

DROP TABLE #TEMP