Creating clustered index on already sorted, very large, table

ranvir · August 11, 2015, 2:00pm

Hi all,

I have a heap which is already sorted on a certain column and I want to put a clustered primary key on this same column. Will this be relatively quick because it is already sorted?
Or is it the same as putting the clustered index on unsorted data?

This is a very large table with almost 1 Billion rows.

Thanks

ScottPletcher · August 11, 2015, 2:18pm

It will be quicker, but I believe SQL will still have to read all the input rows before it can start writing the output rows. If tempdb is large enough, I suggest using SORT_IN_TEMPDB = ON for the build, particularly if tempdb is on faster storage.

Kristen · August 11, 2015, 3:14pm

How did it come to be sorted? Originally imported in ascending proposed Clustered Index Key order? or all records added in that order?

I don't know if that makes the creation of a clustered index "quick" ... or how you could be confident / double-check that the physical organisation of the records actually matches the clustered index key(s)

My thoughts:

If you have Enterprise Version (forgotten when it was introduced) there is an WITH (ONLINE=ON) option, but you might still run out of TEMPDB space ... I think you are Hosed if you have Blob data(**) in the table?? If you have any other (non clustered) indexes on the table then I would drop them and re-create them afterwards - changing the row identifier in the secondary indexes to Cluster Index Keys would make a hard job much harder for SQL.

(**) I think that includes any narrow varchar columns that have got pushed over into ROW_OVERFLOW_DATA allocation unit by larger varchar columns.

Insert existing rows into a new table, with a Clustered Index already defined. Do this in batches

You'll need a trigger on the underlying table so that any changes, after you start the batch job, running will be mirrored. (Haven't checked but Change Tracking might work instead of a Trigger)

Downside: needs as-much-space-again as the existing table, plus loads of log space, and probably loads of TEMPDB too. Maybe you have loads of space?!!

Without a pre-existing Clustered Index you need a smart way to handle the batches otherwise the "What's the next 100,000 rows that I haven't done yet" query could cripple SQL all by itself!

You could export a BCP file sorted in clustered key order, then import that back in to a temporary table (with the "Import file is in Clustered Key order" hint). You then have to figure out which rows in the main table have been changed since you started ... maybe you only ever INSERT rows to this table ... if not you'd, again, need a Trigger on the main table which just dumps changes into a sequential table (Add columns for IDENTITY and Action=INSERT/UPDATE/DELETE and then all original columns). After the BCP has finished importing the table you can then apply everything from your Audit Table to the new temporary table, then then (inside a transaction, or scheduled downtime) drop the original table, rename the temporary table across, and apply any final audit changes that were made "just before you started the locked-period)

Probably worth thinking about pre-expending the files (rather than letting them Auto-Grow)

Hopefully you will be creating a UNIQUE Clustered Index? Might be worth checking that the Key Columns are indeed unique - saves running a process for hours/days and then having it error out AND rollback (more hours/days) due to a DUPE. Perhaps create it UNIQUE with IGNORE DUPs ? (Belt & Braces )

Take the opportunity to partition the table - should improve performance (SQL Queries will only actually reference the physical underlying table(s) that the Range of the query actually impacts)

SORT_IN_TEMPDB will help, as Scott said - but your TEMPDB needs to be on different physical media [i..e not the same as the Application DB]