Disastrously slow SProc .. sometimes

Kristen · April 19, 2017, 7:15am

This is part of an ETL Data Import

Step 1 [does a TRUNCATE and then] pulls a BCP file into an IMPORT table.

Step 2 does a row-by-row comparison of IMPORT table and STAGING table to find any rows that have a change in any column and then deleted the rows. Then an INSERT for any "non existent" rows (which also get current date/time in a LogTimestamp column). (Subsequent ETL SProc then processes "new since last time" rows

The SProc for Step 2 sometimes (maybe 30% of the time) takes a long time. That Sproc is one of many that do the same type of operation. There are about 4 or 5 SProcs that basically have very similar sized IMPORT and STAGING tables and also the number of "changed rows" is very similar. All were mechanically generated so should, to all intents and purposes, be identical. Its only the one SProc that is ever slow.

Normal running of this SProc:

About 2,000,000 rows in the IMPORT table
Typically deletes 1000-2000 rows
Typically inserts 1000-5000 rows

Run time:

On a good day 200 seconds, on a bad day 100,000 !! seconds. There is no "middle ground", either it takes a couple of minutes or it takes more than a day.

I changed the SProc to REORGANISE the indexes on IMPORT before it starts, and to REORGANISE on STAGING after it finishes (ready for next time, and so that downstream ETL have a better time )

Ran in the short time for three days and the last two nights has gone back to 100,000 seconds

Stupidly I forgot the add an Update Statistics after the INDEX REORG , so I'll add that - but can the IMPORT table get so badly out of shape from a TRUNCATE TABLE and then BCP in replacement data (and then REORG the indexes)?

None of the other, similar, processes that handle similar amounts of data has ever run slow, and I can't see that there is anything special about this one.

The process is:

INSERT all rows from IMPORT into ##TEMP

DELETE from ##TEMP any row that is same (all column comparison) as STAGING based on a PKEY JOIN

INSERT INTO STAGING from IMPORT where NOT EXISTS in STAGING

Apart from also freshening up the Stats I could also add a RECOMPILE to the SProc ... although the whole thing is just PKey (clustered) based

Only other thought I have had is to put some "elapsed" time logging in so I can then figure out which sub-query is the slow part.

But is there anything obvious that would explain the change from 200 seconds to 100,000 seconds runtime?

I wonder if something else is going on on the server at that time, its around 02:00, but it seems improbably that THIS Sproc is ALWAYS the slow one - a recurring server-load would be as likely to hit a different Sproc if there was a bit more data one night, and thus each Sproc ran a few seconds later / earlier than normal.

Ifor · April 19, 2017, 9:56am

Is the PK a string? Could the collation of the import table and the main table sometimes be different?

Kristen · April 19, 2017, 10:34am

Thanks for your help

Yup, they are. I pull into ##TEMP without changing the collation - maybe I should fix all the collation at that point so that everything thereafter is same-collation. (That said, ALL the varchar comparisons are forced collation to BINARY in order to catch a change which is only Upper/Lower Case difference, so there is a force COLLATE on every column comparison, so maybe changing the collation from Latin1_General_CP1_CI_AS to SQL_Latin1_General_CP1_CI_AS when pulling the IMPORT data into ##TEMP would not serve much useful purpose. Couldn't hurt ... the data is all going to get re-COLLATEd before it gets into STAGING anyway.

The JOIN is this:

FROM ##TEMP AS S -- This has clustered PKey as does the STAGING Table
		JOIN dbo.STAGING_GP_POP10110 AS D
			 ON D.PONUMBER = S.PONUMBER COLLATE SQL_Latin1_General_CP1_CI_AS
			AND D.ORD = S.ORD --Numeric
			AND D.OtherColumn = S.OtherColumn --Numeric

But ... if this is related to Collation why would it run "fast" 2/3rd of the time, and mega-slow 1/3rd ?

I've put INSERT INTO MyLogTable of time spent in each significant SQL statement, so next time it runs slow I will be able to see which statement took the time.

I've also added WITH RECOMPILE to the Sproc (and added the missing UPDATE STATS after the index REORGs), so those things may actually cure the problem.

Dunno if I can, then, retrospectively see the Query Plan that was used at the time?

I've never been able to reproduce it - perhaps because the EXEC MySProc run from my PC does not use the cached query plan, or somesuch.

Ifor · April 19, 2017, 11:40am

I have noticed that one has to be very careful with data types.

Personally, I would explicitly CREATE ##temp making the datatype, length and collation of PONUMBER exactly the same as PONUMBER in STAGING_GP_POP10110.

My join would then be:

FROM ##TEMP AS S -- This has clustered PKey as does the STAGING Table
	JOIN dbo.STAGING_GP_POP10110 AS D
		ON D.PONUMBER = S.PONUMBER
			AND D.ORD = S.ORD --Numeric
			AND D.OtherColumn = S.OtherColumn --Numeric
			AND D.PONUMBER COLLATE SQL_Latin1_General_CP1_CI_AS = S.PONUMBER COLLATE SQL_Latin1_General_CP1_CI_AS

Kristen · April 19, 2017, 9:41pm

Normally when I CREATE any #TEMP table I explicitly include COLLATE for all varchar ... but not done in this instance. But that said it will be default for the DB and it so i think that should be the same as STAGING (in which case the COLLATE statement in the JOIN is redundant

But I'll double check all that and fix it so that it is correct, thanks.

Still bothered about why this is sometimes fast and sometimes slow - if the COLLATE was a problem surely it would be a problem every time? (and for the other SProcs that basically do the same thing ...)

Ifor · April 20, 2017, 10:25am

I have noticed that sometimes you can get away with type definitions not being quite right and sometimes you cannot. The classic case is declaring a varchar without a length; sometimes it will work out a reasonable length and sometimes it just goes for a length of about 30. We are probably back to your lint idea!

yosiasz · April 20, 2017, 5:20pm

is this a VM. Is it sitting in the same world as other VMS
Is this a physical machine if so have you checked networking? (as in loose ethernet, bad setting on ethernet etc)
Where are the ldf, mdf in relation to other resource intensive processes
What have your network folks said about this? Can they do some networking tracing for you
Is this db sitting with other databases that are hoggers

thinking outside of the db here.
(How is that book coming along?)

Kristen · April 24, 2017, 1:55pm

Yeah, bothered me from Day One how it is set up ... never had SQL on a VM (set up my clients' small-ish-time IT folk) that worked well for SQL ... but I'm an Old Dog and prefer what I am used too ... so may be unfounded criticism. Yes, its in same-world as other VMs

In theory there are no other resource intensive processes at 2AM. Can't speak for the other VMs, but I think it would be astonishing if the timing of other actives managed to hit this particular SProc every day, even though it runs at marginally different times (depending on data volume being transferred that night) ... and change it from a few minutes to 27 hours run time. But there again I have no idea what the actual culprit is ... yet.

There is some (recently added) fancy hardware replication to an offsite DR Server. We were asked to stop backing up anything to the server's own disks because the volume of data transfers was prohibitive - so we now have a network backup which I expect will turn out to be more fragile than local-disks.

They think its our problem!! I think, before network tracing, I ought to get a Query Plan of a 27-hour run and compare it with a query-plan from a 5-minute run and see what is different. IF the query plan (and actual) on a 27-hours run is a disaster then that's what I intend to fix.

Trouble is, I'm struggling to see how a bad plan could arise. Why not a bad plan on the other, very similar, SProcs - why always this one? why not every time? That part seems odd, to me.

I've been away for a few days so not had time to look at this further, but during that time it has not run-slow, the last change I made was to rebuild Stats before the Sproc runs, so maybe that was the issue.

I'm inclined to wait a few days to see if it runs slow again before looking into it further (as in "Its currently working, Client therefore not chasing me ...")

Only our DBs on this server, so I can see the total picture (well ... within this VM).

I'm pleased to hear its going to be a collaboration

Side question: Is 5 minutes to 27 hours runtime due to bad plan, based on poor statistics, a likely scenario?

jeffw8713 · April 25, 2017, 5:58pm

Possible - one thing I would check is the statistics and verify whether or not those statistics were auto updated prior to the 27 hour bad plan. Auto update will use a sampling rate - which could be rather bad if your data is not evenly distributed.

yosiasz · April 27, 2017, 2:27pm

Sorry, I did not mean some other process hitting that specific sproc. I mean some other process hogging up disk and memory at that specific time ==> affects performance of other VMs in that pool depending the way memory and disks are setup. (ignorance in this case is not bliss)

I would lean towards a "not efficiently setup VM server". So in that case a 27-hour period or 20000 Leagues Under the Sea would not reveal an issue with the sproc because the issue would be outside of the sproc, environment specific.

How about this. Any chance you could ask for a dedicated physical sql server with similar setup as the VM, port over all of the data and stuff and setup a test environment to test this out? Setup jobs etc similar to the production system and see what happens?

Jules Verne:

Science, my lad, is made up of mistakes, but they are mistakes which it is useful to make, because they lead little by little to the truth.

Kristen · April 27, 2017, 4:12pm

I put the UPDATE STATS in the start of this specific, slow running, SProc (with FULL SCAN) and it has not run slowly since.

Hadn't considered that, that might well be the answer. Now I have an "operational fix" I'm not in a position to go back and check what the cause was though ...