How to make many big inserts while preventing a lot of unused space

ymoses · January 26, 2023, 7:34pm

I have huge table 'Positions' with a lot of unused space. I want to export it to Positions_New.
And to prevent huge growth of Log, I insert it day by day.

Insert into Positions_New (Vehicle,PosTime,PosLat,PosLon,House,Street,City)
Select Vehicle,PosTime,PosLat,PosLon,House,Street,City
From Positions Where PosTime >= '20190101' And PosTime < '20190102'
GO

Then next day and so on. Custered Primary Key is: Vehicle, PosTime (Vehicle 1st then PosTime) and there is also an index on PosTime, so the Insert is fast. But the order of Insert doesn't fit the Primary Key.
Does it causes a lot of unused space in the new table?
Thank you very very much.

jeffw8713 · January 26, 2023, 8:58pm

What do you mean by unused space? There is no reason to copy the data into a new table to recover 'unused' space. Perform an index rebuild instead.

If you don't have enough log space available to perform an index rebuild - then you need to extend that drive so you do have enough space available.

ymoses · January 26, 2023, 9:16pm

Thank you very much.
I made rebuild. The NDF file grew twice its size, and the LOG file not so much. It took 5 hours, Users could not access the table. The NDF grew so much, and after the process a lot of space became free, but only in the file, not in disk.
The old table contained a huge size of unused space, about third of the table size. The rebuild was very good.
But yet, my question is very interesting me.
Thanks.

ScottPletcher · January 26, 2023, 9:20pm

You could see a lot of page splits.

Probably best would be to pick some number of days that can be INSERTed without causing undue delays or excessive log growth(*) and reducing page splits. For example, say a week at a time rather than 1 day.

(*) Of course if the db is not in simple log mode, you will have to do a log backup to prevent the SQL log file from growing larger and larger.

JeffModen · January 31, 2023, 12:21am

@ymoses ,

You say you have a HUGE table. I've found that the definition of "huge table" varies a whole lot from person to person, so let's get a little more definitive here. Could you post the results from the following code, please?

DECLARE @pTableName SYSNAME = PutYour2PartTableNameHere --I recommend you use the 2 part naming convention of schema_name.table_name here
;
--===== Return the summarized physical stats information in table format for the given table.
 SELECT TOP 2147483647 --TOP is necessary for a sort within a function.
         ObjectName     = CONCAT(OBJECT_SCHEMA_NAME(sta.object_id)+'.',OBJECT_NAME(sta.object_id)) --Rev 01
        ,IndexName      = idx.name          --Rev 01
        ,Fill_Factor    = idx.fill_factor   --Rev 01
        ,SizeMB         = CONVERT(DECIMAL(9,1),sta.page_count/128.0)
        ,IdxID          = sta.index_id
        ,PageType       = sta.alloc_unit_type_desc
        ,IdxLvl         = sta.index_level
        ,FragPct        = CONVERT(DECIMAL(9,4),sta.avg_fragmentation_in_percent)
        ,PageCnt        = sta.page_count
        ,PageDensity    = CONVERT(DECIMAL(9,4),sta.avg_page_space_used_in_percent)
        ,MinRecSize     = sta.min_record_size_in_bytes --This does NOT include the Slot Array of 2 bytes per row.
        ,AvgRecSize     = sta.avg_record_size_in_bytes --This does NOT include the Slot Array of 2 bytes per row.
        ,MaxRecSize     = sta.max_record_size_in_bytes --This does NOT include the Slot Array of 2 bytes per row.
        ,AvgRowsPerPage = ISNULL(8096/NULLIF(CONVERT(INT,sta.avg_record_size_in_bytes+2),0),0) --See Programmer Note #5
        ,RecCnt         = sta.record_count
   FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(@pTableName),NULL,NULL,'DETAILED') sta
   JOIN sys.indexes idx                 --Rev 01
     ON idx.object_id   = sta.object_id
    AND idx.index_id    = sta.index_id
  WHERE OBJECT_ID(@pTableName) > 0      --Rev 01
  ORDER BY ObjectName, PageType, IdxID, IdxLvl --ORDER BY not normally done in a function but is real handy for demo purposes.
;

ymoses · February 1, 2023, 8:22am

Thank you so much.
After 10 minutes, I stopped the run of the query.
Instead I ran sp_SpaceUsed:
SizeGB, Table, ROWS, RESERVED, DATA, INDEX_SIZE, UNUSED
417, dbo.Position, 2,555,883,381, 417 GB, 412 GB, 2 GB, 3 GB

The only index is Primary Key on (DT datetime, Vehicle varchar (50))
Does this teach us enough?
Thank you very much.
Yossi Moses

JeffModen · February 1, 2023, 3:01pm

Your table is nearly half a TeraByte. The code I posted runs sys.dm_db_index_physical_stats() for both indexes on the table. That would take some time. No use now... you said you did a rebuild on the index.

I need to know the datatypes for the Vehicle and PosTime columns and I need an example of, say, 10 different vehicle identifiers so that I can confirm what I'm thinking. It would also be helpful if you could tell me how often a row is inserted for each vehicle in a day.

Also, stop rebuilding indexes on that table using the method(s) that you're currently using. I can show you a method that will leave zero wasted space in the NDF... I just don't have time to write it up right now because I'm "at work".

While you're waiting for that, I need the info on the Vehicle and PosTime columns I ask for above, please.

ymoses · February 1, 2023, 3:43pm

Thank you, I wrote that the only index is the Primary Key and the 1st field is DT - datatype - datetime, 2nd field is Vehicle - datatype - varchar (50).

ymoses · February 1, 2023, 3:45pm

Fix: Not DT but PosTime, anyway, same idea.

JeffModen · February 1, 2023, 5:20pm

Your post made it sound (see the bold above) like you have more than one index.
You also stated the key order differently than in your original post.

Which is actually the correct key order of the clustered index?

Can I have a handful of examples for Vehicle column, as a Iasked?

ymoses · February 4, 2023, 7:56pm

Dear JeffModen, You wrote:
"Stop rebuilding indexes on that table using the method(s) that you're currently using. I can show you a method that will leave zero wasted space in the NDF... "
Please teach ... Thanks so much. Yossi Moses

JeffModen · February 4, 2023, 11:17pm

SUMMARY OF WHAT WE KNOW SO FAR:

In your first post, you wrote...

In reply to one of my questions (and you didn't let my code finish so that it could tell us more)...

... and that certainly doesn't have a lot of unused space BUT... it turns out that you did a rebuild on it and it doubled the disk space used (the reason is because REBUILDs create an entirely new copy of the Clustered Index, which is the full table, before dropping the old one and it does NOT release that space. (Don't bother running my code now because your REBUILD destroyed all the "evidence" )

You also say that the Positions_New table has Clustered Index for the Positions table is as follows...

You also say that the Clustered Index on the Positions_New table is by Vehicle and then Postime. You also have an index on the Positions_New table for the Postime column.

And, the Positions table is, in fact, a bit of a monster weighing in at a little over 4/10ths of a TeraByte.

Step 1:
Get rid of the Positions_New table. Instead of a nice ascending, ever-increasing key based on the Postime column, you're changing it to the Vehicle column. That's going to cause a HUGE amount of logical fragmentation even if it does form "Sequential Silos", which should never have low page densities.

The trouble is, you don't know what caused the fragmentation (and the unused space) in the original table. What makes you think changing the Clustered Index is actually going to fix that? What makes you think that changing the Clustered Index is going to improve performance? Have you checked ALL of the reporting queries to find out what order they need to get the key info as? Do you know the actual performance of those queries on the old table and have discovered exactly WHY you have performance issues or even if you actually have any performance issues?

Or is this just because the table "had a lot of free" space in it?

Step 2:
Figure out the reason WHY the original table has a lot of "free space" in it. For example, is it because you're doing INSERTs and then later doing UPDATEs on the table, which causes rows to grow which causes both logical fragmentation AND low page densities? Creating a new table with a different Clustered Index is NOT going to fix that!

Summary of What Needs to be Done Before Proceeding:
The bottom line here is that I believe that your making a huge mistake by creating that new table, especially with the Clustered Index you've chosen. Until you figure out the things that I've listed in Step 2 above, I believe that you're just wasting time and resources. Figure those things out and come back when you have answers, especially on the reason(s) WHY the table had a bunch of free space in it.

And, no... I'm not being mean or condescending here. I'm trying to help (as are the others) but you have to stop and identify what's happening because you don't actually know what you need to fix first and neither do any of the rest of us.

ymoses · February 16, 2023, 2:59pm

Thank you so much, I will try, but these days I can't , So it waits.
Somehow I missed your answer, I'm sorry.
I just now passed a new question, What's better in creating new index, and the order of Inserts doesn't fit the order of the Index, to create it with fillfactor (80 ?) or it's not necessary (I mean for the fragmetation)

robert_volk · February 16, 2023, 3:40pm

Generally speaking, there's no reason to use any fill factor except 100%.

ScottPletcher · February 16, 2023, 3:45pm

That seems very extreme. That would mean that anytime a single varchar column expands the page could split. Yikes!

ymoses · February 16, 2023, 4:16pm

Hi, First, thank you very much.
But can you explain me more?
I think it is the most normal thing in life. Having index on person names. Then, of course, each insert is another person in no order, Tom, David, John... and so on. what's wrong?

ymoses · February 16, 2023, 4:39pm

Hi, Before SQL Server 2017, there was no option of rebuild online , So 5 hours I had to prevent using the table. By creating it newly, the table can be accessed all the time.

JeffModen · February 16, 2023, 6:28pm

I'll have to take extreme exception to that bit of advice. If you rebuild and index, it's most likely because it was fragmenting. With some exceptions, if you rebuild it to 100% you are guaranteeing that it will fragment almost immediately and massively. That's caused by some serious page splits, which are typically 43 or or more worse for the log file, cpu, duration, etc.

I'll also state that it's better to do no index maintenance than it is doing it wrong and rebuilding fragmented indexes to 100% in a willy-nilly fashion is doing it wrong.k

And, yes... I'm speaking from experience where my production system had massive performance and blocking issues on the proverbial morning after such things were done because I was ignorant about them back then.

JeffModen · February 16, 2023, 6:33pm

I think you have your dates mixed up quite a bit. I've not checked the full history but Online Index Rebuilds have been available since 2005.

RESUMABLE online index rebuilds became available in 2017 but ONLINE index rebuilds were available since 2005.