Hello, Trying to copy 70 million rows into a snapshot table. The data comes from multiple tables. The process taking about four hours to generate around 70 million records.
The table receiving the data has indexes, including a large composite key, so inserting takes several hours. If I remove the indexes the insert takes under an hour but to rebuild the indexes takes almost two hours.
I know there must be a better way, but I have a mind block about it.
Are you also "removing" and recreating the Clustered index? If so might be worth trying with clustered index in place, and then adding the secondary indexes afterwards.
My only other thought is to add the rows in clustered key index order. I don't know if this is possible, or even necessary, using INSERT (e.g. if you just add an ORDER BY to the SELECT part of the INSERT) and I am just basing my thought on the fact that BCP etc. has a hint to indicate that the data being bulk-imported is pre-sorted in clustered index order
I suggest first bulk-loading the 70M rows into a staging table with exactly the same columns, except no identity column. Also exactly the same clustered index, but do NOT create any nonclustered indexes on the staging table. For max speed, you could put the staging table in tempdb.
Then load the staging table into the main table in fairly large batches, say 1M or 2M rows.
Sorry, I didn't write it clearly enough. You do not need an identity column. Create a table that is identical to the existing to do the initial load into:
SELECT TOP (0) *
INTO tempdb.dbo.snapshot_table_staging
FROM maindb.dbo.snapshot_table
CREATE CLUSTERED INDEX snapshot_table_staging__CL ON tempdb.dbo.snapshot_table_staging ( <same_key_column(s)_as_on_main_table>)
INSERT INTO tempdb.dbo.snapshot_table_staging
SELECT ...
FROM <multiple_tables>
Then copy rows in chunks from staging to the main table,in clustered key order.
Thanks. What I did was create a loop copying 1M at a time then delete them from the temp table using the OUTPUT command from INSERT and the clustered index columns.
Yes. For performance, the critical thing is to do it in clustered key sequence. It's much easier if the clustering key is unique, but it's doable even if it's not.
Why is this two-step process (all to Staging, then in batches to Actual Table) more performant than "all to Actual Table"?
Would selecting from the "multiple tables" with a sliding window / paging offset achieve the same thing as "inserting in batches" (without having to populate the staging table first)?
is you use bulk insert like @Kristen example or use insert line by line like INSERT INTO TargetTable(Col1, Col2, ...)values(val1,val2)
and cursor ?
you must use bulk insert
let we see example of your insert query
Thanks
How are you handling the LOOP? For me it would be critical that there was almost-zero-effort in "finding" the rows for the next batch. So I aim for this:
FROM MySourceTable AS S
WHERE S.MyClusteredIndexKey >= @StartPoint
AND S.MyClusteredIndexKey < @EndPoint
because this is really easy for SQL to do I definitely don't have anything in the WHERE clause that does a NOT EXISTS or anything like that (to find the start point, based on what has already copied)
When we bulk-delete stale rows I do something like
SELECT [T_ID] = IDENTITY(int, 1, 1),
MyClusteredIndexKey
INTO #temp
WHERE ... any selection criteria I need ...
ORDER BY MyClusteredIndexKey
My #'TEMP table is then in clustered key order and I can use the IDENTITY column to "page" the #TEMP table.
I can then easily do
DELETE D
FROM #TEMP AS T
JOIN MyRealTable AS D
D.MyClusteredIndexKey = T.MyClusteredIndexKey
WHERE T_ID >= @StartPoint
AND T_ID < @EndPoint
and then I can "slide" the @Start and @EndPoint to the next "page". All the deletes, in each batch, are tightly-grouped in Clustered Index Order - so should be more efficient than "jumping around all over the place"!!
I'm sure you know, and are doing, all this so just mentioning it Belt & Braces
Is this a full load? If yes and in case the schema of the staging table is exactly same as the destination table, maybe you can try switching the data from source to destination.