Stored Procedure Infinitely Loop or Bad code?

fdalmeida · May 9, 2016, 4:20pm

Hello,
I have a stored procedure that when executed runs for days and does not complete until I stop it. the intent is to update a field in batches of 5000 due to size and structure of database.

When I cancel, it shows me several 5000 batches that completed and asks me to commit. I get no errors or blocks.

Running the update statement by itself works fine.

Please help me determine what the issue is or give me suggestions on a better way to tackle this request.

Here is my code for the SP:
CREATE PROCEDURE [dbo].[UPDATE_INDATE_PARENT_EMAILS_AND_EDOCS_revised]
-- Add the parameters for the stored procedure here
@FieldName VARCHAR(16), @StartDate varchar(15), @EndDate varchar(15), @CustodianIDs varchar(500)
AS

PRINT ('NUMBER OF CUSTODIANS:')
DECLARE @XML AS XML
DECLARE @Delimiter AS CHAR(1) =','
SET @XML = CAST((''+REPLACE(@CustodianIDs,@Delimiter ,'')+'') AS XML)
DECLARE @cust_tbl TABLE (custodianid INT)
INSERT INTO @cust_tbl
SELECT N.value('.', 'INT') AS custodianid FROM @XML.nodes('X') AS T(N)

Declare @cust_id varchar(20)
Declare @update_sql varchar(max)

declare cust_curs cursor for
SELECT * FROM @cust_tbl
ORDER BY 1
open cust_curs
fetch next from cust_curs into @cust_id
while @@FETCH_STATUS = 0

BEGIN -- custodian loop

set @update_sql = '

SET NOCOUNT ON;

Select ID into #DOCS from tbldoc (nolock) where custodianid = ' + @CUST_ID + '

SET NOCOUNT OFF;

WHILE (1=1)
BEGIN

BEGIN TRANSACTION

update TOP(5000) tbldoc set ' + @fieldname + ' = 1
from dbo.tbldoc d (nolock)
join #docs d2 on d2.id = d.id
where (
(d.AttachPID in
(select d3.AttachPID from tblDoc d3 (nolock)
join #docs d4 on d4.id = d3.id
where isnull(d3.Exclude,0) = 0 and d3.AttachPID > 0 and d3.AttachLvl = 0 and
(d3.datesent between ''' + @StartDate + ''' and ''' + @EndDate + ''' or
d3.datelastmod between ''' + @StartDate + ''' and ''' + @EndDate + ''' or
d3.datecreated between ''' + @StartDate + ''' and ''' + @EndDate + ''')))
or
(isnull(d.Exclude,0) = 0 and d.AttachPID = 0 and d.AttachLvl = 0 and
(d.datesent between ''' + @StartDate + ''' and ''' + @EndDate + ''' or
d.datelastmod between ''' + @StartDate + ''' and ''' + @EndDate + ''' or
d.datecreated between ''' + @StartDate + ''' and ''' + @EndDate + ''')))

   IF @@ROWCOUNT = 0 
        BEGIN 
        COMMIT TRANSACTION 
        BREAK 
    END 
COMMIT TRANSACTION

END
'

PRINT ('UPDATED INDATE FOR CUSTODIANID ' + @CUST_ID)

exec (@update_sql)

IF OBJECT_ID('tempdb..#DOCS', 'U') IS NOT NULL
DROP TABLE #DOCS;

FETCH NEXT FROM cust_curs INTO @cust_id
END

CLOSE cust_curs
DEALLOCATE cust_curs

GO

stephen_hendricks · May 9, 2016, 4:46pm

A few observations:

Since you really only have the one data statement in the WHILE loop you don't need to have an explicit transaction. Also, you never rollback any bad transaction so what's the point?
The WHILE loop itself is infinite; you are perhaps assuming that it will walk through the "next 5000" every time but it won't. You'll need to define some logic in the WHERE clause to make it look at different data sets each iteration. I think it is undefined which 5000 records will get operated on but it's most likely using the same ones each time.
Having the ability to use a dynamic column name for the UPDATE sounds like a good idea but it really isn't. You'd be better off, in the long run, having a separate procedure for any column you want operated on and letting the caller decide which procedure to invoke.

HTH

fdalmeida · May 9, 2016, 9:09pm

Thank you @stephen_hendricks for your reply. Much appreciated.

This is not a clear cut scenario where I have complete control over all of the... to use loosely... variables. Here is what I mean.

The database that this is reading from and updating is the backend of a poorly designed application. To add to that, I know it does not sound like much, but it has 70 million rows and growing...
Using a dynamic column here is my only option because on the front end, users are creating a new field every time this update needs to be performed so with every execution of this stored procedure, the fieldname will be different.
I think most importantly, because of the size and the poor structure of the database, the field that is being updated is not indexed because it takes about two hours to index new fields that are added. So I am trying to avoid using that field in a select statement.

With that said, do you or does anyone have any suggestions on how to get around this infinite loop problem and still accomplish the updates without going back to this huge table (tbldoc) to check.
Ps. The update of the field is conditional so there will always be records where the field is null.

stephen_hendricks · May 9, 2016, 10:01pm

As you are already aware, this design is not to be desired. I'll take your word that someone thought having the users add fields to the table via the front end was a good idea. You are now tasked with taking what is and making it work.
Having said that, it only addresses my third point; the first two are still valid, in particular the second. The UPDATE inside the WHILE (1=1) loop is never going to come back with zero records, as far as I can see, so it's an infinite loop. It may be updating the same records but its always going to update something. There is nothing in the code that will make it walk through your table and eventually run out of records to update. I'm suggesting that you find some attribute in your table use for this purpose. Maybe it a date, maybe its the primary key, maybe its some other value. Find the minimum value of that attribute and update in chunks. Something along the lines of:[code]declare @MyPK_Min int = 1, @MyPK_Max; -- Outside the loop

select @MyPK_Min = min(MyPK), @MyPK_Max = max(MyPK)
from tbldoc;

while @MyPK_Min <= @MyPK_Max
begin
Update (delete the "top 5000") (all your stuff)
where (your stuff)
AND MyPK between @MyPK_Min and (@MyPK_Min + 5000)

-- Do not test for @@Rowcount = 0 since some ranges may not cause records to be updated

set @MyPK_Min = @MyPK_Min + 5000;
end[/code]Also, you might play around with the value of 5000. I suspect that a larger value will perform better. Perhaps.....100000? But let's get it working first.

fdalmeida · May 11, 2016, 1:22am

@stephen_hendricks
I took your brilliant recommendation and did the following which not only resolved my issue but the query completed in record time. Thank you very much!!

DECLARE @Min_ID INT,
@Max_ID INT;

SELECT @min_id = Min(id),
@max_id = Max(id)
FROM tbldoc (nolock)
WHERE custodianid = @cust_id
AND Isnull(exclude, 0) = 0

WHILE @Min_ID <= @Max_ID
BEGIN
UPDATE TOP(100000) tbldoc
SET @fieldname = 1
FROM dbo.tbldoc d
WHERE d.id BETWEEN @Min_ID AND ( @Min_ID + 100000 )
AND d.custodianid = @cust_id
AND ( ( d.attachpid IN (SELECT d3.attachpid
FROM tbldoc d3
WHERE d3.id BETWEEN @Min_ID AND (
@Min_ID + 100000 )
AND d3.custodianid = @cust_id
AND Isnull(d3.exclude, 0) = 0
AND d3.attachpid > 0
AND d3.attachlvl = 0
AND ( d3.datesent BETWEEN
@StartDate AND @EndDate
OR d3.datelastmod BETWEEN
@StartDate AND @EndDate
OR d3.datecreated BETWEEN
@StartDate AND @EndDate
)) )
OR ( Isnull(d.exclude, 0) = 0
AND d.attachpid = 0
AND d.attachlvl = 0
AND ( d.datesent BETWEEN @StartDate AND @EndDate
OR d.datelastmod BETWEEN @StartDate AND @EndDate
OR d.datecreated BETWEEN @StartDate AND @EndDate
)
) )

  SET @Min_ID = @Min_ID + 100000;

END

Kristen · May 11, 2016, 5:47am

Kick that NOLOCK habit ... its going to bite in you the behind ...

stephen_hendricks · May 11, 2016, 7:27pm

Wow....You're welcome.

(You can still remove that "TOP(100000)" from the UPDATE statement. It's not necessary and it will only slow things down.)

Glad to have been of service...

Kristen · May 12, 2016, 5:51am

If its running fast enough then no need for a change, but we've generally found that with a complex WHERE clause getting all the PKeys into a Temp table, with an IDENTITY, for all the rows matching the complex WHERE clause first and then using a @Parameter to step through IDENTITY ranges of that Temp table, JOINing to the actual table on the PKey, is faster than re-querying the main table repeatedly (i.e. including all the complex WHERE statements) for the next range - as that can take significant time to process - e.g. a table scan may be require every loop.

Actually instead of PKey the Clustered Index keys need to be used, in the event that is different to the PKey

Kristen · May 12, 2016, 6:03am

Something like this is what I had in mind:

CREATE TABLE #TEMP
(
	T_ID int IDENTITY(1,1) NOT NULL,
	ID int NOT NULL, -- The Clustered Index Key(s)
	PRIMARY KEY
	(
		T_ID
	)
)

INSERT INTO #TEMP(ID)
SELECT d.ID
-- Original WHERE clause here:
FROM dbo.tbldoc d 
WHERE d.custodianid = @cust_id 
	AND (
		( d.attachpid IN 
			(SELECT d3.attachpid 
			FROM tbldoc d3 
			WHERE d3.custodianid = @cust_id 
				AND Isnull(d3.exclude, 0) = 0 
				AND d3.attachpid > 0 
				AND d3.attachlvl = 0 
				AND ( d3.datesent BETWEEN 
						@StartDate AND @EndDate 
					OR d3.datelastmod BETWEEN 
						@StartDate AND @EndDate 
					OR d3.datecreated BETWEEN 
						@StartDate AND @EndDate 
					)
			) 
		) 
		OR ( Isnull(d.exclude, 0) = 0 
			AND d.attachpid = 0 
			AND d.attachlvl = 0 
			AND ( d.datesent BETWEEN @StartDate AND @EndDate 
				OR d.datelastmod BETWEEN @StartDate AND @EndDate 
				OR d.datecreated BETWEEN @StartDate AND @EndDate 
				) 
		)
	)
--
ORDER BY d.ID -- Sort by the Clustered Index Key(s)


DECLARE @LoopID int = 1,
	@LoopSize int = 100000,
	@RowCount int = 1 -- Force first loop
WHILE @RowCount > 0
BEGIN 
	UPDATE U
	SET @fieldname = 1 
	FROM #TEMP AS T
		JOIN tbldoc AS U
			ON U.ID = T.ID
	WHERE	    T_ID >= @LoopID
		AND T_ID <  @LoopID + @LoopSize
	SELECT @RowCount = @@ROWCOUNT, @LoopID = @LoopID + 100000
END