Strategies for deleting large amounts of data from a shared database

JeffModen · March 29, 2019, 12:04am

Correct... by itself, using an IDENTITY column doesn't always prevent fragmentation. I actually have examples that I teach where I use random GUIDs as both the PK and the CI and then demonstrate how to go for months of INSERTs not only with no bad page splits, but no good ones either and there is NO fragmentation created either logically or physically.

You're also correct that using something else as the Clustered Index key doesn't necessarily cause fragmentation but, I will say, that if you've not done the very consideration that Kalen suggests, it will. Just as you say about "Logical Design", I find that a lot of people don't consider "Physical Design" either and both of those lack of consideration are what people get into trouble with. In other words, I totally agree with you that understanding what needs to be done frequently defines how it needs to be done and a lot of people either don't take the time or, usually, don't know what is actually possible. That also means that IDENTITY should neither be the default nor avoided like the plague. "It Depends".

To make it clear, I don't think that IDENTITY is "sacred" but I also think that when someone gets in a huff by calling it "sacred" and recommending something else, they also need to explain ALL of the ramifications of avoiding it and using something else. You've done a great job of explaining the advantages of not using it but you've not even come close to explaining the possible disadvantages of avoiding it.

And, to be equally clear, I agree that way too many people just slap an IDENTITY column into a table and use it as a Clustered Index by default. That's certainly not a panacea but neither is avoiding it. "It Depends" but it also depends on a whole lot more than what most people think.

ScottPletcher · March 29, 2019, 5:05pm

I've done this for thousands of tables and almost never seen the problem you insist almost automatically causes "massive" fragmentation.

This example should provide a good test. Assuming data is (almost) always looked up by ( client, some_date ), then the table should be clustered by ( client, some_ascending_date, identity /to insure uniqueness/ ). SQL should be able to detect that inserts are sequential and should avoid page splitting when adding lots of new rows, exactly as it does for a lone identity key.

Let's cluster it that way and see if any other indexing approach can get the same fast, consistent results. Especially when a new column is added to the query, as will inevitably happen over time.

Moreover, the argument that the extra key cols go into every non-clus index is mute too, since if you almost always search by them, they will be put into covering indexes anyway, to match the where conditions.

What other disadvantages are there? If there's not lots of fragmentation, and if those columns would be needed in covering indexes anyway -- since we're clustering it that way on the premise that (almost) all / all critical queries search by those columns anyway.

Certainly it needs to be tested. By tables can also be reorganized if needed. Not too often, but if needed. Exactly as you do for fragmented tables clustered by identity.

I'm in a "huff" as you so snarkily put it, because this over-dependence on identity does destroy any hope of logical design and any possibility of normalization. Normalization rules depend on business data keys; thus, it's literally impossible to even attempt 2NF, 3NF or BCNF without a known data key. And it can terribly hurt performance in the final tables, particularly of common joins. I dislike it so much because the "always use identity" myth is so very damaging.

One also needs to consider all the disadvantages of clustering by identity, esp. for child tables. You're basically forced to create all sorts of covering indexes, since parent-to-child joins are naturally based on the parent key, not on the child's identity (which the parent doesn't know).

JeffModen · March 29, 2019, 8:54pm

It wasn't a snark... you're the one that used the word "stupid" to start with. Just following through in kind.

What is the data type for the Client column, please? Also, what do you anticipate for an average row width? Once I know those things, I'll be able to come back with a self populating working example that you can run/test on your own. One of the things that you're forgetting, though, is that you don't need page splits to get some really nasty logical fragmentation in the form of what I call "Sequential Silos".

Once you've provided the information, it may take a bit for me to get back to you. I just arrived in Cincinnati for SQL Saturday and won't get back home until Sunday evening.

JeffModen · April 1, 2019, 12:17am

Ok, Scott... here you go. This creates a Clustered Index as described by you. Neither you nor the OP described what the datatype or content of the "Client" column should be so I made a guess. Let me know if I did it wrong according to what you were expecting.

The end result is >98% fragmentation at the end of the second day even though the inserts are done in exactly the same order as the Clustered Index keys on a daily basis.

Even if you insert for 1000 days (meaning plenty of rows so that fragmentation shouldn't be quite as bad as time wears on) and then do an index rebuild, you're still going to hit 6% fragmentation on the next day of inserts. If you're one of the poor folks that believes that you should REORGANIZE the index when you hit 5%, then you'll be REORGANIZING the Clustered Index EVERY TIME you run index maintenance. If you believe that you should wait until 10%, then it will "only" take 22 days until you need to REORGANIZE

--=====================================================================================================================
--      Create and populate a table to support the test.
--      It simply contains a list of "Clients" that will be used to populate the test table.
--=====================================================================================================================
--===== Create the support table and populate it with a unique list of 1,000 clients.
 WITH 
 E1(N)       AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))E0(N))
,cteClients  AS (
                 SELECT TOP (1000)
                        Client = 'Client'+RIGHT(CONVERT(VARCHAR(10),ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))+10000,4)
                   FROM E1 a, E1 b, E1 c, E1 d, E1 e
                )
 SELECT Client
   INTO dbo.UniqueClientList
   FROM cteClients
;
--=====================================================================================================================
--      Create a table to store logical fragmentation values for each iteration of the "day test"
--=====================================================================================================================
 CREATE TABLE dbo.FragLevel
        (
         Counter        INT
        ,FragPercent    FLOAT
        )
;
--=====================================================================================================================
--      Setup the test table and the clustered index on the test table
--=====================================================================================================================
--===== Create the test table
 CREATE TABLE dbo.TestTable
        (
         Client             VARCHAR(10)
        ,SomeAscendingDate  DATE
        ,RowNum             INT IDENTITY(1,1)
        ,OtherColumns       CHAR(200) DEFAULT 'X' --Simulates other columns in the table.
        )
;
--===== Create the Clustered Index that Scott described.
 CREATE CLUSTERED INDEX IXC_TestTable
     ON dbo.TestTable (Client,SomeAscendingDate,RowNum)
;
--=====================================================================================================================
--      Run the test.  For each day in a given range of dates, insert the Client and the DATE.
--      Not only will the DATE be in ascending order, but we'll also make sure the CLIENT is inserted
--      in ascending order each day just to try to eliminate fragmentation and to prove that it's all
--      for naught.
--=====================================================================================================================
--===== Make sure the test and frag measurement tables are empty at the beginning for repeated runs
        TRUNCATE TABLE dbo.TestTable;
        TRUNCATE TABLE dbo.FragLevel;

--===== Local variables and presets
DECLARE  @Counter   INT = 1
        ,@Days      INT = 7
;
  WHILE @Counter <= @Days
  BEGIN 
        --===== Add just 1 entry per Client per day
         INSERT INTO dbo.TestTable
                (Client,SomeAscendingDate)
         SELECT  Client
                ,SomeAscendingDate = DATEADD(dd,@Counter,'Jan 2019') --Inherently sorted by the day loop.
           FROM dbo.UniqueClientList
          ORDER BY Client --Note that we're even inserting in the same order as the leading column of the CI.
        ;
        --===== Measure and save the fragmentation level for this day (counter).
         INSERT INTO dbo.FragLevel
                (Counter,FragPercent)
         SELECT  Counter = @Counter
                ,FragPercent = avg_fragmentation_in_percent
           FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('dbo.TestTable'),NULL,NULL,'DETAILED')
          WHERE index_level = 0 --Limit the return to just the most important level, the Leaf Level
        ;
        --===== Bump the counter
         SELECT @Counter += 1
        ;
    END
;
--=====================================================================================================================
--      Let's look at the framentation.  Notice that we not only got fragmentation on the first day
--      but the second day went right straight to a massively fragmented state (> 98%).
--      This is because this CI is what is known as a "Sequential Silo" (my name for it)
--=====================================================================================================================
 SELECT *
   FROM dbo.FragLevel
;
--===== For test cleanup when done.
-- DROP TABLE UniqueClientList,FragLevel,TestTable

Here are the results...

The reason for the massive fragmentation is because there are daily inserts and, although we're inserting in the same order as the Clustered Index keys on a daily basis, we're not inserting in the Clustered Index order on an overall basis.

I'm working on a larger example to demonstrate why you may have missed the fragmentation in the past.

jeffw8713 · April 1, 2019, 6:55pm

Jeff - what is the fill factor on the clustered index you created for this test table? If the fill factor does not allow space for inserts/updates - then yes, absolutely - you will encounter fragmentation immediately that will get worse over time.

However - if you set the fill factor to 95 and rebuild the index with that fill factor every 7 days, then after the initial load of data you will see the table normalize to around 10% fragmentation over time.

Modify your test to run for 120 days with the fill factor at 95 and a rebuild every 7 days and you should see something like this:

You cannot just look at the first 7 days either...since the will be in a mixed extent and show very high fragmentation until it gets large enough.

ScottPletcher · April 1, 2019, 8:22pm

The situation in your script is vastly different than the q stated, which was huge existing tables that are being queried. That's quite different from an empty table being loaded. For example, if a table holds 3 years' worth of data, then 1 day's load represents less than 0.1% of the table, a trivial amount of fragmentation overall.

I'll look at it, something else may be missing. Perhaps key first by date if that turns out to be more useful in this case. And/or separate partitions for large clients' data. Nothing here yet proves that clus by identity is the best method.

Also, you didn't address the q of the lookup queries themselves, the focus of the qs. It's the total I/O that's important. WIth both methods, including one clus on identity (as you presumably prefer). What covering index would you need? You added 200 bytes to the table, which will help your lookup ratio, whereas proper normalization and page compression, and even forcing things off page as required, makes such a row size somewhat bigger than average, but not extremely so.

Even then, you have the issue of the "tipping point", where unless the non-clus index is fully covering, SQL may suddenly fall back on a full table scan. It's endless futzing over non-clus indexes if the clus index is not the best, certainly every time a new column gets added to a result query (since the hand-crafted covering is obviously no longer covering).

I don't have any more time now but I'll follow up later.

JeffModen · April 2, 2019, 12:01am

To be sure, I wasn't trying to match what the OP was using (and he didn't describe it much, anyway). I was trying to match what you were talking about (even the column names).

And, yes... I absolutely agree on the idea of a bigger table and that's why I said I'm working on a larger example. I just need more details, preferably in the form of a CREATE TABLE statement as I've asked for on this and other threads. The testing that I've done does show fragmentation slowing down after more than a year of inserts (finally drops to less that 30% fragmentation for weekly maintenance) but is still substantial for a table that's supposed to get or be large. As you know, large clustered indexes are a real pain to defrag especially when they have a lot of other columns, which is the only reason why I padded each row with 200 bytes to simulate other columns in the CI.

As to the testing of queries, I agree. I've not tested that for this thread yet because I don't yet know what the table should actually consist of for schema or data yet. One good example of that is I'm currently inserting 1000 rows per day, once for each Client. Real life doesn't work that way and some will have more daily entries than others. I'm just waiting for someone to further define what the table and data should look like.

Funny that you bring up "forcing things off-page as required"... I've almost just finished a presentation on that very subject.

JeffModen · April 2, 2019, 12:09am

Way ahead of you, Jeff. I have a 4 year test completed. It does use a Fill Factor of 100 and you might think that a problem but there are no expansive updates here and the code you're looking at actually does things in the same order as the clustered indexes. There aren't going to be any page splits (it's a "Sequential Silo" pattern that I've previously documented that causes no page splits) and a 100% Fill Factor is actually appropriate. The fragmentation that occurs will be purely from out of order pages being created.

ScottPletcher · April 2, 2019, 5:38pm

That depends. If you ever UPDATE those rows after you've added a trigger to the table, regardless of any specific column being UPDATEd or not, or set SNAPSHOT isolation on, then you could see a lot of page splits if the pages are 100% full.

ScottPletcher · April 24, 2019, 1:59pm

Any results yet?

JeffModen · April 24, 2019, 2:45pm

Yes and no... I started on it but was overtaken by work and actually lost track of this post. I'll see if I can resurrect what I've tested sometime this weekend.

JeffModen · April 25, 2019, 3:05am

Actually, for you, Scott, I'm still waiting on you to provide a CREATE TABLE statement along with the definition of the clustered index you're talking about with a couple of lines of data so that I can try to match what your talking about. This is important because you took exception to me padding the index with what I believe a reasonable amount of "other" data would be. Rather than trying to second guess you, it would be helpful if you'd post the information I've outline above for your example.

ScottPletcher · April 25, 2019, 3:15pm

CHAR(100) rather than CHAR(200) would be fine. Even at CHAR(200), for a huge existing table, as in the actual OP, the best clustering will match the query patterns, not with an identity column.

JeffModen · April 25, 2019, 3:48pm

Again I ask, please provide the CREATE TABLE, the code for the index, and a couple of lines of data so that I don't have to go back and forth with you on something that I may have misunderstood. For example, on the 22nd post of this thread, you mentioned that you would include an IDENTITY column for "uniqueness". I included such a column and now you're saying not to.

It won't take you that long, Scott, and it will be absolute clarification. Thanks.