I'm getting "The transaction log for database 'xxx' is full due to 'ACTIVE_TRANSACTION' when I'm running a process to insert 309 million records into a table. We're using minimal logging methods and the INSERT statement only has a couple CASE statements and one function.
I'm able to run the query when I break it down into service date years (2014, 2015, etc) within SSIS but I'm hoping to find a way to have it run for all years (2014 - present)
Here's what my insert looks like (this is 309 million records)
select bunch of fields,
some case statements,
one function (the function is just another query - no updates or inserts)
from MedClaims c
JOIN ClaimHeader h
ON C.ID = H.ID
where c.Date >= '01/01/2014'
Is 309 million records just too much, even with minimal logging techniques? When I break it out to run within chunks, it takes about 20-30 hours to complete.
How did you verify that you're using "minimal logging"?
Keep in mind that most often you'll need a "WITH ( TABLOCK )" hint to allow minimal logging. "SELECT ... INTO dbo.new_table" also does minimal logging but you don't want to do that for the actual inserts, only to create the table structure, because of the internal system table locks it holds.
Theres no cross-server or database. The recovery model is set to simple, as far as i know.
The stored procedure is actually on server B, while the Inserts is all done within Server A. Theres no traffic crossing over to any other database or server though
Ooops sorry. I meant the stored procedure is on Database B but the Insert statement is all performed on Database A. So there's no data going from one DB to another. We don't think that would eliminate minimal logging though, right?
and why is this huge dataset being transferred to another database? performance reasons? separation of concerns?
have you looked at the option of creating a proper data.lake, data.store specific to this BI endeavor?
Dimension and Facts etc
Data isn't being transferred from database to database, just tables. I recreated the stored procedure in Database_A just to try out if things will run better tonight.
We're a pretty small analytics shop but we are investing in tools to help get us to building out a proper data lake.
Before:
use database_B
GO
select bunch of fields,
some case statements,
one function (the function is just another query - no updates or inserts)
from database_A.MedClaims c
JOIN database_A.ClaimHeader h
ON C. ID = H. ID
where c.Date >= ‘01/01/2014’
Now Running with this tonight:
use database_A
GO
select bunch of fields,
some case statements,
one function (the function is just another query - no updates or inserts)
from database_A.MedClaims c
JOIN database_A.ClaimHeader h
ON C.ID = H.ID
where c.Date >= ‘01/01/2014’
oh man forgot to ask the simple question (which is usually the best answer)
Do you have separate files for both .ldf and .mdf files? and are they being saved to local disk? and what is the state of the drive on which these are saved?
How big is the LDF Transaction File when it fails? Would be helpful to know what sort of size of problem this is.
I wonder if it would be worth using the Query to output to BCP file (in Native format), sorted by the Clustered Index of the target table, and then do a bulk-import on it?
Seems like a sledge-hammer to crack-a-nut though ...
We need to see the actual query code, not just a very brief summary of what the code contains, as well as full table and index definitions. You wrote: "Here's what the INSERT looks like", but that code doesn't even include an INSERT!
The recovery model of the source db is irrelevant.
The destination db log file should be expanded to the size needed for the load before the load begins. This prevents dynamic expansion of the log, which can be a very slow operation.
It would also be very helpful to have the result of this command:
EXEC [destination_db].sys.sp_helpfile
And, to make sure the log file is not too fragmented, the number of rows of output from this command (NOT the result themselves, just the total number of rows listed:
Good point. Also worth checking what the expansion is set to - it might be 1MB ... or 10% which, on a large database, can be a huge single-extension increment.
We fiddle with all our large DBs to pre-configure the Log file to either a large amount (up to 8GB) or extend it in 8GB chunks if we need larger than that (but not an exact multiple of 8GB - although I don't know if that is still an issue)
I see that you are using SSIS...so a couple of notes on how to improve performance with that system:
On the OLEDB Source - make sure you are using a SQL command to extract the data. This can either be the query itself - or preferably a stored procedure. Using a table or view directly can be extremely slow...
On the OLEDB Destination - make sure you use the option 'Table or view - fast load', check the option for 'Table lock' and 'Check constraints' and set the rows per batch to a reasonable size (200000 works well on my systems) and finally - make sure you set the maximum insert commit size is also set to a reasonable size (again 200000 works well for my systems).
The maximum insert commit size will issue a commit at that size - so the transaction log can then be marked as reusable and not grow to fill the drive and run out of space.
It may be possible to use the SQL Server Destination instead of the OLEDB Destination - which may (repeat may) perform better, but there are restrictions. You definitely have to set parameters using the advanced editor - and insure you set a MaxInsertCommitSize (and other properties as needed) - and you have to insure that the package is running on the destination node.