I have a table in a database (it's a data warehouse) which is populated once a day by a truncate table and is then repopulated from another database. The table being populated has a primary key of an integer field which is the identity field from the source table. The insert statement has an order by statement as part of the code so the data is definitely going into the table in the order of the primary key of the destination table. However after population the fragmentation on the primary key index is over 96 percent. How can that be ?
Is the primary key column also the clustering key column?
The table only has one index and that is on that one field and is a primary key clustered.
Multiple insert threads maybe? Set MAXDOP=1 ?
Can you use Trace Flag 610 maybe? Might improve insert performance.
The MS DOCs suggest :
If the table is empty and has a clustered index and the data in the data file is ordered to match the clustered index key columns, additionally do the following:
- Bulk-import the data with the clustered index already in place.
- Specify the ORDER hint, as well as the TABLOCK hint.
INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted. +
so the insert order could be anything SQL thinks is a good idea at the time.
If you are using SSIS to load the data - then insure the query that extracts the data from the source system extracts the data in the order you want.
Also make sure you check the box for 'Table lock' as well as setting the rows per batch and maximum insert commit size (I use 200000 as a starting point).
Of course - you can just eliminate this as an issue by rebuilding all indexes on the table after the data has been loaded. Accept the fragmentation is going to happen regardless of what you do to prevent it - and just rebuild with 100% fill factor after every load as part of the normal process.
How wide is the table? Is the identity column used in joins or criteria? Clustered Indexes should be on the columns that would boost performance the most and not necessarily on the PK.