Data Churn - many small deletes or one big one

kayakingJeff · August 24, 2017, 7:37pm

I have a table with twenty columns that contains about 4 million records (40GB). Each record only lives in the table for about 10 days before it is discarded. Each record has a datetime field to identify when it was inserted into the table. The table has 1 clustered and 4 non-clustered indexes (1GB).

Currently we have a task that runs every minute that prunes out the old data based on that datetime column. Every time it runs approximately a thousand records get deleted.

At the same time (different process), new data is being inserted into this table at about the same rate as the deletions - ~1000/minute.

This is working well to keep the table at a pretty stable size but we are starting to experience issues where occasionally either the CPU rises up to 80-90% or we run into locking issues that might momentarily prevent updates.

The issue has started to worsen so now we are looking for other ideas on how to better manage the data churn in this table. I don't know for sure but I'm imagining that all the inserts and deletes are causing the index updates to push our server to its limits.

I can't control how many records are flowing into the table but I can control when the data gets deleted. The inserts are pretty constant throughout the day, maybe slowing slightly in the middle of the night. Would batching up the deletes to occur every 6 or 12 hours be more efficient than what I'm currently doing? The table size would certainly grow but I'm thinking that with the fewer index updates I might be able to avoid other performance issues.

Thoughts?

Thanks,
Jeff
SQL 2014 Enterprise

gbritton · August 24, 2017, 8:10pm

I'd schedule the deletes for outside business hours. Batching them is a good idea too... keeps the transaction log smaller.

Another option would be to partition your table by the datetime column. Then you could purge the oldest partition (eleven days old) every night using the switch/truncate process. Good description here:

http://www.jamesserra.com/archive/2012/07/sql-server-table-partition-switching/

that way, deleting is painless and you will always have 10 days of active data.

ScottPletcher · August 24, 2017, 8:57pm

You need to cluster the table on that datetime column first; you can add an identity / other column*(s) to make the key unique if you prefer. I'm assuming you also frequently look up by that datetime.

That will both isolate and make contiguous the area where pages are DELETEd and where pages are INSERTed and thus prevent unnecessary locking.

kayakingJeff · August 24, 2017, 9:29pm

Actually the datetime column is only used to help identify when the record should purged from the system. The table is an outbound queue for records that have already been processed. There are a couple of other columns that are used to uniquely identify a specific record. The records will be mostly clustered together since they are getting deleted in the same order as they were inserted into the table due to an identity column being the primary key.

So getting back to my question, is there a performance advantage to performing a few large deletes over doing many small deletes? We don't have any downtime/slow time as our customers are pretty evenly spread around the globe.

I know the answer with SQL Server is often "It depends". If I could I'd set up a second environment that exactly duplicated production to experiment on but I don't have that luxury at this time. Unfortunately a delay as short as a few seconds is noticed by our customers and reported on so I'm pretty nervous to just try something without being reasonably sure the results will be positive.

So another wrinkle in this scenario that I failed to mention earlier is that depending on the value in one of the columns the record may only have a life of 3, 5 or 7 days. So partitioning just based on date isn't going to work in this case. But the sense I'm getting is that grouping a bunch of deletes together a couple times a day is probably better than doing a lot of small deletes throughout the day.

Thanks,
Jeff

gbritton · August 25, 2017, 12:42pm

But you could have a fancier partition function that takes the life expectancy into account

kayakingJeff · September 7, 2017, 9:07pm

I've setup a test environment with a daily partition that goes back 2 weeks. I've confirmed that the data is getting separated out into the proper partition when inserted. Each evening I'm just truncating the 14th day partition and adjusting the partitioning function to account for the next day.

Its working but I'm still wondering about the original question of 'Is it better to have many small operations throughout the day or a few large ones?'

At any rate my record deletions are occurring with minimal impact on the server (just additional disk space) and that's what I was aiming for.

Thanks,
Jeff

gbritton · September 8, 2017, 3:24pm

The answer to the original question kinda depends on your definition of large and small. One thing about smaller deletions is that less log space is consumed. if that's important to you, then there's an advantage. If large = 1,000,000 rows or more, I'd certainly opt for a set of smaller ops, just to keep the log usage reasonable.

kayakingJeff · November 30, 2017, 9:40pm

Thanks for the insight on the impacts of deletions. I ended up implementing partitions, each with a day's worth of data. A nightly job swaps the partitions out and effectively removes the data without any impact on the logs or CPU. Its been running for a couple months now and is pretty effective. We will be implementing this technique to several other tables as well now that it has proven itself to work really well.

Thanks!

Kristen · December 1, 2017, 7:55am

I didn't see your original thread - must have been asleep! Glad the Partitioning is working well for you.

I don't have a specific answer to your question, but some thoughts:

I see lots of DELETE processes that does a loop like this

DECLARE @ROWCOUNT int = 1 -- Force first iteration
WHILE @ROWCOUNT >= 1
BEGIN
   DELETE TOP 10000 D -- Or some other means of defining a batch size to delete
   FROM MyBigTable AS D
    JOIN MaybeSomeOtherTable AS O
       ON O.ID = D.ID
    WHERE D.CreateDate < @MyCutoff
    SELECT @ROWCOUNT = @@ROWCOUNT
END

The "problem" I have with this is that each time around the loop the DELETE is running a, potentially massive, query. That can compound the interference with INSERTS. In your case probably the query is complex because of the life expectancy component. (By the by, I would definitely make sure that there was an index on Date + Life Expectancy component; better still would be to have a "Kill On " date column, so that a straight Start/End point on that one-column covering index could be used.

The way we tackle it is to put all the (Unique) Clustered Index keys, for the records to be deleted, into a #TEMP table, which also has an IDENTITY (PKey for #TEMP)

Then loop joining #TEMP to the original table (on clustered index), based on a range of ID values - to control the size of the batch. Make sure that there is NO Transaction during the looping! This causes all deleted on the Main Table to be in Clustered Index order, and means that there are no "slow" queries on Main Table during the delete process itself.

INSERT INTO #TEMP(PKey1, Pkey2, ...)
SELECT PKey1, Pkey2, ...
FROM MyBigTable AS D
    JOIN MaybeSomeOtherTable AS O
       ON O.ID = D.ID
WHERE D.CreateDate < @MyCutoff

DECLARE @OFFSET int = 1 -- Force first iteration
      , @BATCH_SIZE int = 100000 -- Size fo each loop delete
WHILE @OFFSET >= 1
BEGIN
   DELETE D
   FROM #TEMP AS T
    JOIN MyBigTable AS D
        ON D.PKey1 = T.PKey1
       AND D.PKey2 = T.PKey2
    ...
    WHERE T.ID >= @OFFSET
      AND T.ID < @OFFSET + @BATCH_SIZE
    SELECT @OFFSET = CASE WHEN @@ROWCOUNT = 0 THEN 0 -- Finished
                     ELSE @OFFSET + @BATCH_SIZE
                     END
    -- ... other activity here ...
END

We add "other activity" which can include:

How long did that iteration take? Longer than @MAX_THRESHOLD? If so divide the @BATCH_SIZE by 2, shorter then @MIN_THRESHOLD? If so add 10% to @BATCH_SIZE. This adjusts to other activity on the server. Its not perfect - we still see some interations taking a massive amount of time, relative to other ones, and regardless of other activity on the server, but if some "busy process" sparks up then the delete will "back off"

Also: Add a PAUSE between each ITERATION of, say, 2 seconds - to allow other processes to run

We would schedule the bulk-delete at the time of the night of lowest activity

But the PARTITION approach is best for this type of "sequential add" and "sequential delete" type activity.

kayakingJeff · April 27, 2018, 6:00pm

Just in case someone else is in need of deleting large amounts of data in a very active table here are a couple of resources to help manage the deletions without creating other problems in the database.

We went with a daily partition swapping process but that doesn't work for all situations, like say on a table that participates in peer-to-peer replication. In that situation you need to carefully perform deletes. I hope these following links help someone.

Brent Ozar

Michael J Swart

-Jeff