SQL 2012 Updates/Joins pulling incorrect data

ITTrucker · February 23, 2017, 9:02pm

Terrible title, but we have had some instances of various SQL stored procs/cursors/SSIS packages that do something wrong that we cannot replicate. Is there any known data integrity issue that could potentially arise with SQL 2012 running on a VMWare machine with an old slow crummy SAN (V-Motioning between resource groups), and Veeam backups running their Explorer for Microsoft SQL Server?

Here are a few examples:

We have a SSIS package that takes a file and copies it to two servers and then imports into two separate databases and then runs a bunch of stored procedures to update the data. There is different amount of data in the affected tables between the two servers/databases, but the schemas and stored procs are the same. A few times, one of the records in one database was updated incorrectly (records matched on criteria that was not matching), but it was fine in the other database. The job runs overnight and the discrepancy was caught by another job right after so it was not a user issue.
There was a stored procedure that had a cursor that looked something like:
DECLARE @Groups VARCHAR(5);
DECLARE curG CURSOR FOR
SELECT ID FROM Group where SomeBitField = 1
open curG
FETCH NEXT FROM curG INTO @Groups
WHILE @@FETCH_STATUS = 0 BEGIN
--Do stuff WHERE Group = @Groups

And it proceeded to update a group where SomeBitField = 0. These fields only exist in the database, there is no front end for them, SomeBitField for that group has always been 0.

We have an SSIS job that runs after business hours that has a series of SQL commands. One of them creates some records if they match a criteria and then sends an email if there were records created. The job ran and created the records and sent the email, but the records shouldn't have been created. Our IT team has a backup of the database from an hour before the job ran and the job ran against that and didn't create records, I have a backup file that I restored to a training environment and it didn't create the records, and when I run the queries on the server where the records were created, it doesn't flag any new records.

So it's a mix of SSIS jobs, cursors, stored procedures, and while it's rare, it's concerning. Any time that we have seen these, if the job/process is re-run in a backup or training environment, it never fails so we can't repeat the issue.

It's not the same tables affected and it's not every time, it seems sporadic. I'm concerned that there might be some issue with slow congested disk while Veeam is trying to "quiesce" the database for a backup and VMWare is VMotioning the server, and there's some sort of indexing issue or other thing that's happening and data is changing out of the normal running database?

Kristen · February 25, 2017, 10:57am

Is there anything in the code you are referring to, or ANY OTHER CODE that might be running at the same time, that is using NOLOCK? (If you are not sure you could run SQL Profiler during the overnight time interval and (I think) a FILTER for NOLOCK should be 100% accurate in finding any occurrences within SQL statement [inside SProcs] (except perhaps if WITH ENCRYPTION is used on any SProcs). Would need testing though to be sure that all SQL statements, either adhoc or within SProcs, get "monitored"

Are you using Read Committed Snapshot and your query is referring to a copy, of the same row, that was just updated / selected / whatever in the same session? In such instances you are likely to need a LOCK between the first and last references to that row.

ITTrucker · February 27, 2017, 4:48pm

There's only one stored procedure in the database with a NOLOCK on it it's a aspnet proc which is hitting 3 aspnet_ tables and looks like it's never been run (at least not since last reboot a few weeks ago).

I don't see any stored procedures with the text '%ISOLATION%' or '%COMMITTED%' in them, the database is set to the default Is Read Commited Snapshot On : True

The problem doesn't occur every night so it's tough to pinpoint. Everything that runs overnight is scheduled at a specific time, no folder pollers or anything that is trigger based, and we try to stagger jobs as much as possible to prevent any issues so I would think we would be able to replicate the issue, but we can't.

Kristen · February 28, 2017, 9:22am

If you, say, update a Row and then use that row in a subsequent statement the original value (at the start of the batch) will be used (unless you provide a locking hint). So that could be a cause. It would be reproducible though.

Could that be the problem? (Seems slightly improbable as it would be consistent, and therefore more likely to occur every night).

NOLOCK is much more likely to cause "unexplained & unrepeatable" outcomes. I have an alert here on 601 "Data movement during NOLOCK". My experience is that this will ONLY trigger on a small number of NOLOCK conflicts (the rest are the dangerous ones where a query sees some index entries twice, or misses some altogether), but if you put an Alert on that AND you get any hits then NOLOCK is definitely present AND causing side-effect problems. Silence tells you nothing though, unfortunately

If you restore a backup, from before an update that did have an error-outcome, onto a test server and run the batch update there do you get the same wrong-data? That would rule out/in the effects of concurrent processes, so would isolate whether it is interference (hard to reproduce) or something reliably reproducible and thus fixable in coding.