Strategies for deleting large amounts of data from a shared database

craigdavidsmith · March 27, 2019, 2:54pm

We have some huge multi-tenant databases. When a client leaves us, we have to delete their data after 30 days. If it is a small client, it isn't a big deal because the delete script takes a few minutes to run. But if it is a large client, the script can take many hours to run and that locks the tables for the rest of the clients trying to use the system. We implemented a stored proc that deletes 500 records at a time per transaction which allows other queries to run in between but it still has a noticeable impact on performance. Are there any better strategies to accomplish this?

nigelrivett · March 27, 2019, 3:58pm

You probably need to re-architect the system with this in mind.
The simplest way is to partition the data by data by client then you can just remove the partitions.

If you can't do that:
You could also look at why your trickle deletes is having impact. Have you implemented referential integrity? If so a delete of referenced data will need a search of the referencing data and vice-versa, often there is only an index on one side which will cause a table scan of the other table. For deletes you need to index both sides.

Could just be that you need to throttle your delete SP - add pauses between each loop and make sure it doesn't run at peak times.
Also find out which deletes are slow and fix those.
Make sure it isn't table scanning for any of the deletes - if necessary create a table which holds the primary keys of rows to delete and use those for the delete.

ScottPletcher · March 27, 2019, 4:31pm

You should cluster the tables first on client, assuming you always do, or can, use the client id as part of the WHERE clause to read data. That allows all processing to be faster, including deletes when necessary.

[Don't believe the stupid myth that you should "always" cluster a table on identity. That's just wrong and cause you huge performance loss.]

craigdavidsmith · March 27, 2019, 4:48pm

Not using Enterprise so can't do the partition thing

But I'll try these other suggestions. Thanks everyone.

yosiasz · March 27, 2019, 4:49pm

is it not possible to create a database per customer?

craigdavidsmith · March 27, 2019, 4:52pm

Possibly if we decide to jump to Enterprise Edition but right now we are using AGs with Standard Edition which means allowing two IP addresses per AG and you can only have one database per AG (in Standard Edition). Since we are using AWS, we are limited to the number of IP addresses per instance which means we can only setup a limited number of AGs. So we do have some of our super huge clients in the their own database but we can't do that for all clients right now.

Note: we have 1000 clients (for context).

yosiasz · March 27, 2019, 4:59pm

that is a lot.

or one database with each customer having their own table(s) with prefix or suffix indicating which ones they own. and another set of tables that manages the metadata.

just wild guesses here

robert_volk · March 28, 2019, 12:49am

FYI partitioning is available in all SQL Server editions since version 2016 with Service Pack 1 or higher. If you are on that version or can upgrade to it it will be available for you.

That being said, 1000+ clients to partition is going to take some forethought, it won't be quite as easy as described. You'll have multiple tables that will all need to maintain the same or structurally similar partitioning keys in order to be able to switch out partitions. Also note that you can TRUNCATE a table by partitions in SQL Server 2016 and higher, as long as you don't have foreign keys referencing the table being truncated.

JeffModen · March 28, 2019, 3:52am

You should also not believe in the stupid myth that the clustered index should be based on such things as client. There's hell to pay in the world of fragmentation if you do and there's hell to pay on the non-clustered indexes if your clustered index isn't unique.

JeffModen · March 28, 2019, 4:11am

How long is it taking to delete the 500 "records"?
Also, are you waiting between iterations of the DELETE loop or are you just looping around the DELETE?
How many total "records" are in the table and how many are you trying to delete?
You say the databases are "huge". What is the definition of "huge" here and how many tables are you actually talking about?
Also, do the tables you're deleting from have DRI in place in the form of FKs?

ScottPletcher · March 28, 2019, 2:08pm

The most common problem in actual practice is that people slap an identity on a table and create a covering index for every major query and assume they have the best performance. The majority of the time you don't.

That is the single most damaging "design" myth. The majority of tables are best clustered by something other than identity.

Again, I'd like to see an actual example of the type of fragmentation you're talking about. I'm almost never seen it when the index parameters are correct.

craigdavidsmith · March 28, 2019, 4:16pm

Not sure. It has been awhile since we ran it and have only recently been considering starting it back up (or looking at a different method thus this post). I don't recall it taking too long for 500 records 10-20 seconds maybe). Note: since we haven't run it in awhile, the script (which is huge) needs to be modified to reflect more recent schema changes.
We've experimented with different pauses between transactions. Too much of a pause and the delete process is way too slow.
Varies tremendously by client but a fairly large client could have 10 million records in the audit table, 5 million in the activity table, 3 million in 5 other tables and our schema, in total, has about 120 tables.
All of our databases combined are about 20TB in size.
We have FKs but not constaints (not sure if that answers your question).

Thanks!

JeffModen · March 28, 2019, 4:32pm

You keep saying that IDENTITY columns are stupid and you've also not provided such a coded example of how replacing the IDENTITY column with something else solved the problem. In our previous contact, I also suggested that you provide a table and tell me how you'd like me to populate it for test purposes and that I'd be happy to do so for both of us. Then you can demonstrate with code what you're talking about and so can I.

ScottPletcher · March 28, 2019, 4:44pm

The best solution is still likely to cluster on client first.

You could even consider partitioning to isolate very large clients (we do that here). Partitioning can be unavailable in certain versions of SQL, so that may not be available to you.

ScottPletcher · March 28, 2019, 4:49pm

I never said they were "stupid". I said automatically clustering tables on them, without considering better alternatives, is stupid. As is completing ignoring the design step and automatically slapping identities on in a so-called "design" process.

You can't do a logical design with identities, because an identity doesn't exist in the logical design phase. And it makes it impossible to properly normalize. How do you determine if non-key attributes relate only to the full key if the key is just a meaningless number? You can't, period.

I'll admit, in a sense that's not an issue at all for developers, since they skip logical design completely and go straight to a physical "design", most often automatically starting with identities on every table. With that start, the design will be bad, and overly-wide tables will be unavoidable.

Most often when I ask developers later what the data-based key to a table with an identity is, they have no clue. It's never been thought about. "The identity is the key, that's all we need."

No, an identity is a physical implementation tool of a logical design, it's not part of a design, and shouldn't be used nearly as often as it is.

JeffModen · March 28, 2019, 7:31pm

No problem. Post the code to create a "Client" table (since that's what's being used on this post) along with your definition of the Clustered Index and what the population pattern is (single row by many, small batches, large batches, whatever) and I'll see if I can demonstrate what I mean by it being subject to unnecessary fragmentation. I'd be happy to build a table for this but I want to avoid the possibility and back'n'forth thing of "no... that's not what I meant for the table design".

Also... let's not talk about "Logical Design" because we're talking about the actual design of the table and indexes, which is "Physical Design/Implementation" and that's actually what we're talking about here, especially when it comes to things like the impact it will have on fragmentation.

As far as "normalization" goes, be sure to indicate the parts of the Client table you'll provide that are key to "normalization".

ScottPletcher · March 28, 2019, 8:00pm

Yes, although ignoring the logical design is what gets us into the mess in the first place. No design = terrible, inflated "table" structures, need for gazillions of covering indexes so queries can still run within any reasonable time frame.

I would need to see existing index usage and missing index stats, along perhaps with cardinality of some columns, to determine the best clustered index for that specific table. I'd be happy to do that here, or with any reasonable number of tables, if needed.

The key thing is to not just assume that slapping an identity on a table and automatically clustering on it is (nearly) always right. It's not, it's stupid and can hurt performance a lot. There is no such thing as a "default" clustering index. It's far too important a choice for that.

The "narrow, ever-increasing, etc." cluster key advice should not be so slavishly adhered to. Yes, it's a good general guideline, but it's not one of the hundreds of commandments Moses brought down from the mountain.

That's what experts are: people who know when it is right to vary from generic guidelines to match specific cases.

yosiasz · March 28, 2019, 8:19pm

This is one of the most confusing and frustrating, to me, subject in my career. There are as many opinions on this subject as there are SQL developers on this. I once attended a SQL Conf and the gentleman there explained is so well I forgot to take notes!

JeffModen · March 28, 2019, 9:53pm

The same is true about changing to something else. As with all else in SQL Server, "It Depends". Heh... and,yeah... let's see you put a Clustered Index on a "Logical Design". While I agree that it's usually a sorely lacking step, we are NOT talking about "Logical Design" here.

For those interested and watching this thread as to why so many people push the "narrow, ever-increasing, etc" thing, please take the time to watch the following Microsoft Certified Master you tube by none other than Kimberly Tripp. She worked at Microsoft on this very thing.

While I DO agree that even "world renowned" experts can be incorrect, watch the video and learn things that you might not know about indexes and Clustered Indexes in particular. Keep in mind the other things to consider for Clustered Indexes are "unique, immutable, and non-nullable" Also consider that the keys for every Clustered Index are silently added to the leaf level of all Non-Clustered indexes and how much extra space and memory that can require if the Clustered Index Keys aren't "narrow". Also consider the extra column data that will be created in the non-clustered indexes if the Clustered Index is not unique.

Then also watch the second 'tube where Paul Randal talks about page splits and the devestation it can cause in the areas of logical fragmentation, physical fragmentation, extended waits cause by blocking during the page splits, and the huge amount of log file entries (which extends the waits even more) that can be cause by page splits.

ScottPletcher · March 28, 2019, 11:35pm

It's simply false that not using a "sacred" identity column as the clustering key will cause page splits.

It's also false that using an identity always prevents page splits. It doesn't, particularly if varchar columns grow.

Moral of the story: never just lazily slap an identity on (nearly) every table to use a clustering key because of a catchy saying. The clus index is the most critical element for best performance, and it shouldn't be ever be defaulted.

Kalen Delaney quite rightly points out that you should very carefully consider the choice of a clustering key and that just defaulting to an identity, say, is often the wrong choice.