SQLTeam.com | Weblogs | Forums

Copy large table to table


#1

Hello, Trying to copy 70 million rows into a snapshot table. The data comes from multiple tables. The process taking about four hours to generate around 70 million records.

The table receiving the data has indexes, including a large composite key, so inserting takes several hours. If I remove the indexes the insert takes under an hour but to rebuild the indexes takes almost two hours.

I know there must be a better way, but I have a mind block about it. :slightly_smiling:


#2

Are you also "removing" and recreating the Clustered index? If so might be worth trying with clustered index in place, and then adding the secondary indexes afterwards.

My only other thought is to add the rows in clustered key index order. I don't know if this is possible, or even necessary, using INSERT (e.g. if you just add an ORDER BY to the SELECT part of the INSERT) and I am just basing my thought on the fact that BCP etc. has a hint to indicate that the data being bulk-imported is pre-sorted in clustered index order


#3

Thank you @Kristen. I will let you know in a couple of hours how the order by worked.


#4

"Rather your server, than mine" :slightly_smiling:


#5

:slightly_smiling:
I have a test system I try it on before going to my development system before being reviewed for production. :yak:


#6

That sounds great. Perhaps you can run my tests as well as your own?!!


#7

I suggest first bulk-loading the 70M rows into a staging table with exactly the same columns, except no identity column. Also exactly the same clustered index, but do NOT create any nonclustered indexes on the staging table. For max speed, you could put the staging table in tempdb.

Then load the staging table into the main table in fairly large batches, say 1M or 2M rows.


#8

Would that stage be sorted by Clustered Index, or should that not matter?


#9

Forgot about doing it that way. How would you suggest grouping if you do not have an identity column?


#10

Sorry, I didn't write it clearly enough. You do not need an identity column. Create a table that is identical to the existing to do the initial load into:

SELECT TOP (0) *
INTO tempdb.dbo.snapshot_table_staging
FROM maindb.dbo.snapshot_table
CREATE CLUSTERED INDEX snapshot_table_staging__CL ON tempdb.dbo.snapshot_table_staging ( <same_key_column(s)_as_on_main_table>)

INSERT INTO tempdb.dbo.snapshot_table_staging
SELECT ...
FROM <multiple_tables>

Then copy rows in chunks from staging to the main table,in clustered key order.


#11

Thanks. What I did was create a loop copying 1M at a time then delete them from the temp table using the OUTPUT command from INSERT and the clustered index columns.


#12

Yes. For performance, the critical thing is to do it in clustered key sequence. It's much easier if the clustering key is unique, but it's doable even if it's not.


#13

I'm probably being thick ... as usual!

Why is this two-step process (all to Staging, then in batches to Actual Table) more performant than "all to Actual Table"?

Would selecting from the "multiple tables" with a sliding window / paging offset achieve the same thing as "inserting in batches" (without having to populate the staging table first)?


#14

The thing I am not, yet!, understanding is why doing the following is not enough (on its own):

INSERT INTO TargetTable(Col1, Col2, ...)
SELECT Col1, Col2, ...
ORDER BY TargetClusteredIndexKey1, TargetClusteredIndexKey2

i.e. without staging table and without doing it "in batches" (although batches would be possible with sliding window offset)


#15

is you use bulk insert like @Kristen example or use insert line by line like
INSERT INTO TargetTable(Col1, Col2, ...)values(val1,val2)
and cursor ?
you must use bulk insert
let we see example of your insert query
Thanks


#16

@MohamedAliSalim, it is like @Kristen's

INSERT INTO SNAP_ASC0
(ClientMasterShortname, App, RefNum, AcctSearchKey, PortfolioCode, ContactNum,
ScrubType, ScrubSentDate, ScrubReturnDate, ScrubSource, ScrubResponse)
SELECT TOP (@TCount) ClientMasterShortname, App, RefNum, AcctSearchKey, PortfolioCode, ContactNum,
ScrubType, ScrubSentDate, ScrubReturnDate, ScrubSource, ScrubResponse
FROM #Snap_1
ORDER BY App, RefNum, PortfolioCode, AcctSearchKey, ContactNum, ScrubType, ScrubSentDate;

Which did not do well in a loop. I will try another approach and let you know.


#17

How are you handling the LOOP? For me it would be critical that there was almost-zero-effort in "finding" the rows for the next batch. So I aim for this:

FROM MySourceTable AS S
WHERE     S.MyClusteredIndexKey >= @StartPoint
      AND S.MyClusteredIndexKey <  @EndPoint 

because this is really easy for SQL to do :slightly_smiling: I definitely don't have anything in the WHERE clause that does a NOT EXISTS or anything like that (to find the start point, based on what has already copied)

When we bulk-delete stale rows I do something like

SELECT [T_ID] = IDENTITY(int, 1, 1),
       MyClusteredIndexKey
INTO #temp
WHERE ... any selection criteria I need ...
ORDER BY MyClusteredIndexKey

My #'TEMP table is then in clustered key order and I can use the IDENTITY column to "page" the #TEMP table.

I can then easily do

DELETE D
FROM #TEMP  AS T
     JOIN MyRealTable AS D
          D.MyClusteredIndexKey = T.MyClusteredIndexKey
WHERE     T_ID >= @StartPoint
      AND T_ID <  @EndPoint

and then I can "slide" the @Start and @EndPoint to the next "page". All the deletes, in each batch, are tightly-grouped in Clustered Index Order - so should be more efficient than "jumping around all over the place"!!

I'm sure you know, and are doing, all this so just mentioning it Belt & Braces


#18

Is this a full load? If yes and in case the schema of the staging table is exactly same as the destination table, maybe you can try switching the data from source to destination.