TempDB

dr223 · January 14, 2016, 11:12am

Hi,

I am running an INSERT statement (the table is built with 1 billion records). The problem is the TempDB grows to over 600 GB and fills the disk prior to completion of the query executing.

I checked the tempDB and details as follows ;

Recovery model: Simple

Logic Name: tempdev
Initial size: 8MB
Autogrowth: By 10 percent, unrestricted growth

Logic Name: templog
Initial size: 1MB
Autogrowth: By 10 percent, unrestricted growth

Please advise if I need to do anything to control the size of the tempDB from growing out of control and utilising the entire disk space..

Thanks

gbritton · January 14, 2016, 1:45pm

Do the insert in batches.

Kristen · January 14, 2016, 1:49pm

... as I advised in he O/P's previous post, which was about running the Buffer Pool out of memory ...

and then there is another post by the O/P about a "Large Log File" problem ("as a result of large insert statements. Millions of records at one time")

which would also be helped by batch processing of the inserts (either coupled with frequent Log backups, or using Simple Recovery Model [if that is appropriate])

dr223 · January 14, 2016, 2:35pm

Thanks Kristen - doing it in batches does it reduce the growth of the tempDB?

Based on the script earlier provided;

DECLARE @intMinPatientID INT = 0, @intMaxPatientID INT = 0, @intRowCount = 1 -- 1=Force first iteration
	, @intBatchSize int = 100000000	-- Iteration batch size (set to a size that works OK!)
--
WHILE @intRowCount > 0
BEGIN
--
SELECT TOP @intBatchSize
	@intMaxPatientID = PatientID
FROM	dbo.EmisEventExtraction
WHERE	PatientID > @intMinPatientID	-- See note #2
ORDER BY PatientID
SELECT	@intRowCount = @@ROWCOUNT	-- Number of rows found
--
INSERT INTO dbo.EmisEvent (EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal)
 SELECT EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal 
 FROM dbo.EmisEventExtraction WITH (tablock)
--
WHERE	PatientID > @intMinPatientID AND PatientID <= @intMaxPatientID
ORDER BY PatientID	-- See note #1
--
SELECT	@intMinPatientID = @intMaxPatientID
--
CHECKPOINT
--
END

I receive the following errors at run time;

Msg 102, Level 15, State 1, Line 10
Incorrect syntax near '@intBatchSize'.

Please help

Thnaks

gbritton · January 14, 2016, 3:24pm

Are you running SQL Server 2005 or earlier? Then you need to separate your variable declarations and set statements:

declare @foo int;
set @foo = 42

dr223 · January 14, 2016, 3:34pm

I am working with SQL server 2008.. Any help

Thanks

TaraKizer · January 14, 2016, 3:39pm

We'll need to see the full script.

gbritton · January 14, 2016, 3:43pm

You are missing the type for @intRowCount. You need to write:

DECLARE @intMinPatientID INT = 0, @intMaxPatientID INT = 0, @intRowCount INT = 1 -- 1=Force first iteration
	, @intBatchSize int = 100000000	-- Iteration batch size (set to a size that works OK!)

dr223 · January 14, 2016, 4:02pm

DECLARE @intMinPatientID INT = 0, @intMaxPatientID INT = 0, @intRowCount INT = 1
	, @intBatchSize INT = 100000000	
--
WHILE @intRowCount > 0
BEGIN
--
SELECT TOP @intBatchSize
	@intMaxPatientID = PatientID
FROM	dbo.EmisEventExtraction
WHERE	PatientID > @intMinPatientID	-- See note #2
ORDER BY PatientID
SELECT	@intRowCount = @@ROWCOUNT	-- Number of rows found
--
INSERT INTO dbo.EmisEvent (EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal)
 SELECT EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal 
 FROM dbo.EmisEventExtraction WITH (tablock)
--
WHERE	PatientID > @intMinPatientID AND PatientID <= @intMaxPatientID
ORDER BY PatientID	-- See note #1
--
SELECT	@intMinPatientID = @intMaxPatientID
--
CHECKPOINT
--
END

Msg 102, Level 15, State 1, Line 11
Incorrect syntax near '@intBatchSize'.

JamesK · January 14, 2016, 4:21pm

Two comments, see below:

DECLARE @intMinPatientID INT = 0, @intMaxPatientID INT = 0, @intRowCount INT = 1
	, @intBatchSize INT = 10000
	-- USE A MUCH SMALLER BATCH SIZE. USING A BILLION AS THE BATCHSIZE DEFEATS
	-- WHOLE POINT OF HAVING THE LOOP THAT INSERTS IN BATCHES
--
WHILE @intRowCount > 0
BEGIN
--
-- WHEN YOU USE TOP N YOU NEED BRACKETS IF N IS NOT A LITERAL CONSTANT
SELECT TOP ( @intBatchSize )
	@intMaxPatientID = PatientID
FROM	dbo.EmisEventExtraction
.....

TaraKizer · January 14, 2016, 4:22pm

It's because of this part:

SELECT TOP @intBatchSize
@intMaxPatientID = PatientID

Get @intMaxPatientID only. Use TOP @intBatchSize in the INSERT/SELECT.

JamesK · January 14, 2016, 4:26pm

Thanks Tara.

Please ignore my comment about using TOP N in the first select. You are only trying to find the total number of rows in that select.

dr223 · January 14, 2016, 4:28pm

Will running this in batches sort my TempDB space issues

and secondly is thi sOk ;

SELECT TOP (@intBatchSize)
@intMaxPatientID = PatientID

ScottPletcher · January 14, 2016, 4:38pm

The only reason I can think of for tempdb to grow that big is that you have snapshot isolation on for the db you are loading the billion+ rows to. In that case, wow, the performance is not going to good, since every insert must be done twice.

At the very least, you should pre-allocate enough log space in your main db and in tempdb to handle two batches. You should also dramatically increase the base size and autogrowth amount for tempdb (their sizes now are ridiculously small). And make sure that IFI (instant file initialization) is on.

--------- main db commands ------------
ALTER DATABASE db_name MODIFY FILE ( NAME = <logical_log_file_name>, SIZE = 2GB, FILEGROWTH = 40MB )
ALTER DATABASE db_name MODIFY FILE ( NAME = <logical_log_file_name>, SIZE = 4GB )
ALTER DATABASE db_name MODIFY FILE ( NAME = <logical_log_file_name>, SIZE = 8GB )
--------- tempdb commands -------------
ALTER DATABASE tempdb MODIFY FILE ( NAME = templog, SIZE = 2GB, FILEGROWTH = 40MB )
ALTER DATABASE tempdb MODIFY FILE ( NAME = templog, SIZE = 4GB )
ALTER DATABASE tempdb MODIFY FILE ( NAME = templog, SIZE = 6GB )
--make sure IFI is on before running these commands!
ALTER DATABASE tempdb MODIFY FILE ( NAME = tempdev, SIZE = 8GB, FILEGROWTH = 100MB )
ALTER DATABASE tempdb MODIFY FILE ( NAME = tempdev, SIZE = 16GB )
ALTER DATABASE tempdb MODIFY FILE ( NAME = tempdev, SIZE = 24GB )
ALTER DATABASE tempdb MODIFY FILE ( NAME = tempdev, SIZE = 32GB )

TaraKizer · January 14, 2016, 4:45pm

No, remove @intBatchSize from that. You want to ONLY get @intMaxPatientID and instead do the TOP stuff in the INSERT/SELECT.

SELECT @intMaxPatientID = PatientID
...

INSERT ...
SELECT TOP @intBatchSize...

Kristen · January 15, 2016, 8:16am

I respectfully disagree (although in this particular example it may be moot)

My experience has been that having a, potentially complex, WHERE clause on the INSERT / SELECT can be slower than putting that somewhere else.

In this case there is no obvious unique single-column value, and I have no idea what the clustered index definition is, but normally I would know that of course.

What I have found (in general terms) is:

Preparation step is to SELECT into a #TEMP table the "ranges" for each batch iteration. If the INSERT batch size is, say, 10,000 rows then I want to know what every 10,000th value is. In Step Two I can then insert with a SELECT which has a WHERE MyColumn BETWEEN MyStart AND MyEnd, based on the range values I prepared earlier.

It might be that the criteria for selecting rows to be inserted is complex, the query might not even have a good index, so rather than repeating that query multiple times, inside the INSERT loop, I prefer to do that, outside, with a single query (into a #TEMP table). That Select uses an appropriate WHERE clause, and stores only the Clustered Index Key(s) in #TEMP. Ideally it will only store every 10,000th row, ordered by Clustered Index key values, but if that cannot be done easily (or user does not have the skills to do that) then storing all of the keys, with an IDENTITY, is also OK. Sorted by Clustered Index Key and also ahving an IDENTITY makes it easy to find Row #1, Row #10,001, etc. to get the limits for each Insert loop iteration

Then iterate the #TEMP table using that for Start / End values. The INSERT / SELECT will then be processing 10,000 rows, per iteration, and they will be in INSERT Clustered Index Key order.

That's my theory anyway!

TaraKizer · January 15, 2016, 5:09pm

I wasn't suggesting that it's the best solution. I was helping with the syntax error.

Kristen · January 16, 2016, 10:05am

OK, I'm confused ... as I thought you were also suggesting moving it to the INSERT statement?

TaraKizer · January 16, 2016, 8:28pm

Yes, but to fix the syntax error. I have not looked at the code very closely. I don't normally get the MAX value inside the loop. I get it before the loop and then use math to increment/decrement the variable that's doing the tracking. I think the code that was provided is from another thread.

dr223 · January 18, 2016, 2:17pm

I run the code;

DECLARE @intMinPatientID INT = 0, @intMaxPatientID INT = 0, @intRowCount INT = 1
	, @intBatchSize INT = 100000000	
--
WHILE @intRowCount > 0
BEGIN
--
SELECT TOP (@intBatchSize)
	@intMaxPatientID = PatientID
FROM	dbo.EmisEventExtraction
WHERE	PatientID > @intMinPatientID	-- See note #2
ORDER BY PatientID
SELECT	@intRowCount = @@ROWCOUNT	-- Number of rows found
--
INSERT INTO dbo.EmisEvent (EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal)
 SELECT EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal 
 FROM dbo.EmisEventExtraction WITH (tablock)
--
WHERE	PatientID > @intMinPatientID AND PatientID <= @intMaxPatientID
ORDER BY PatientID	-- See note #1
--
SELECT	@intMinPatientID = @intMaxPatientID
--
CHECKPOINT
--
END

It was executing for 16 hours and nothing happening - I had to terminate the execution.. Is the issue the script ?

Please not in the fields in the query all are normal keys.

EventID is the PK and its auto incremental value that get populated when the records get inserted.

Also the SELECT statement is taking a very considerable time to extract the records

 SELECT EventGuidDigest,PatientID,ConsultationID,StaffID,EventDate,EventType,MedicalCodeID,Value,Unit,Abnormal 
 FROM dbo.EmisEventExtraction WITH (tablock)

and the dbo.EmisEventExtraction is shown below;

 SELECT SELECT     EMISNov15DB.dbo.Event.EventGuidDigest, dbo.EmisPatient.PatientID, dbo.EmisConsultation.ConsultationID, dbo.EmisStaff.StaffID, 
                      EMISNov15DB.dbo.Event.EffectiveDateTime AS EventDate, EMISNov15DB.dbo.Event.EventType, EMISNov15DB.dbo.Event.SnomedCTConceptId, 
                      EMISNov15DB.dbo.Event.EmisCode, EMISNov15DB.dbo.Event.ReadCode, EMISNov15DB.dbo.Event.Term, dbo.CPRDLkupMedical.MedicalCodeID, 
                      EMISNov15DB.dbo.Event.NumericValue AS Value, EMISNov15DB.dbo.Event.NumericUnits AS Unit, EMISNov15DB.dbo.Event.IsAbnormal AS Abnormal
FROM         EMISNov15DB.dbo.Event WITH (tablock) INNER JOIN
                      dbo.EmisPatient ON EMISNov15DB.dbo.Event.PatientGuidDigest = dbo.EmisPatient.PatientGuidDigest INNER JOIN
                      dbo.EmisStaff ON EMISNov15DB.dbo.Event.AuthorisingUserRoleGuidDigest = dbo.EmisStaff.StaffGuidDigest LEFT OUTER JOIN
                      dbo.EmisConsultation ON dbo.EmisPatient.PatientID = dbo.EmisConsultation.PatientID AND 
                      EMISNov15DB.dbo.Event.ConsultationGuidDigest = dbo.EmisConsultation.ConsultationGuidDigest LEFT OUTER JOIN
                      dbo.CPRDLkupMedical ON (EMISNov15DB.dbo.Event.SnomedCTConceptId = dbo.CPRDLkupMedical.Snomed OR
                      EMISNov15DB.dbo.Event.SnomedCTConceptId IS NULL AND dbo.CPRDLkupMedical.Snomed IS NULL) AND 
                      ISNULL(CASE WHEN LEFT(EMISNov15DB.dbo.Event.EmisCode, 7) = 'EMISATT' THEN 'EMISATT' WHEN LEFT(EMISNov15DB.dbo.Event.EmisCode, 5) 
                      = 'PCSDT' THEN 'PCSDT' ELSE EMISNov15DB.dbo.Event.EmisCode END, N'(novalue)') COLLATE SQL_Latin1_General_CP1_CS_AS = ISNULL(dbo.CPRDLkupMedical.Emiscode,
                       N'(novalue)') COLLATE SQL_Latin1_General_CP1_CS_AS AND (EMISNov15DB.dbo.Event.SnomedCTDescriptionId = dbo.CPRDLkupMedical.SnomedDescription OR
                      EMISNov15DB.dbo.Event.SnomedCTDescriptionId IS NULL AND dbo.CPRDLkupMedical.SnomedDescription IS NULL) AND ISNULL(EMISNov15DB.dbo.Event.Term, 
                      N'(novalue)') = ISNULL(dbo.CPRDLkupMedical.Term, N'(novalue)') AND ISNULL(CASE WHEN LEFT(EMISNov15DB.dbo.Event.ReadCode, 5) 
                      = 'PCSDT' THEN 'PCSDT' WHEN LEFT(EMISNov15DB.dbo.Event.ReadCode, 7) = 'EMISATT' THEN 'EMISATT' ELSE EMISNov15DB.dbo.Event.ReadCode END, 
                      N'(novalue)') COLLATE SQL_Latin1_General_CP1_CS_AS = ISNULL(dbo.CPRDLkupMedical.Readcode, N'(novalue)') COLLATE SQL_Latin1_General_CP1_CS_AS

Can I update any of the codes above to improve performance..

Any help please

Thanks