HELP - Consistency Errors on Database - SQL 2014

DanRed · February 6, 2017, 12:01pm

Hi Team SQLTeam

We have a database that over the last week started getting consistency errors
After checking all hardware was okay ran a dbcc checkdb
And had a about 130 Consistency errors accross multiple tables

Loaded up the data in a test Database and ran the dbcc Check DB ('DatabaseName', Repair_Allow_Data_Loss') command and then tested the application, this seemed to work, reports that would previously not run, were working again and when running the dbcc checkdb it came up as zero allocation and zero consistency errors

As this had worked in test, got the okay to do the same on the production environment on the weekend.
Put the Prod DB in Single User Mode, ran the repair, did a checkdb and all came up clear.

Back in the office today, after spending a Sunday wondering what Monday was going to be like...
Recieved a few calls regarding SQL error messages from the application
SQL Error Code : 37000

Ran CheckDB again and note that there were errors in 2 tables [1 table had the same amount as before the repair and the other table had 6 error messages, pre repair it was at 62] Other tables that previously had errors were not present.

Error messages are to do with Consistency Errors and read as follows

BARTAG TABLE
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1307959) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1307960) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1307961) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1307962) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1307963) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1308036) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1308037) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:1308038) allocated to object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Page (1:1307959) was not seen in the scan although its parent (1:1298359) and previous (1:1307958) refer to it. Check any previous errors.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 290 refers to child page (1:1307960) and previous child (1:1307959), but they were not encountered.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 291 refers to child page (1:1307961) and previous child (1:1307960), but they were not encountered.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 292 refers to child page (1:1307962) and previous child (1:1307961), but they were not encountered.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 293 refers to child page (1:1307963) and previous child (1:1307962), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Page (1:1307964) is missing a reference from previous page (1:1307963). Possible chain linkage problem.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Page (1:1308036) was not seen in the scan although its parent (1:1298359) and previous (1:1308035) refer to it. Check any previous errors.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 367 refers to child page (1:1308037) and previous child (1:1308036), but they were not encountered.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Index node page (1:1298359), slot 368 refers to child page (1:1308038) and previous child (1:1308037), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2116918613, index ID 1, partition ID 72057620090847232, alloc unit ID 72057621783576576 (type In-row data). Page (1:1308039) is missing a reference from previous page (1:1308038). Possible chain linkage problem.
DBCC results for 'BARTAG'.
There are 925546 rows in 102961 pages for object "BARTAG".
CHECKTABLE found 0 allocation errors and 18 consistency errors in table 'BARTAG' (object ID 2116918613).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (DBNAME.dbo.BARTAG).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

MVTSTK TABLE
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data), page (1:3636015). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data): Page (1:3636015) could not be processed. See other errors for details.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:3636016) allocated to object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data). Page (1:3636015) was not seen in the scan although its parent (1:3634670) and previous (1:3636014) refer to it. Check any previous errors.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data). Index node page (1:3634670), slot 8 refers to child page (1:3636016) and previous child (1:3636015), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 70291310, index ID 4, partition ID 72057620278804480, alloc unit ID 72057621984509952 (type In-row data). Page (1:3636017) is missing a reference from previous page (1:3636016). Possible chain linkage problem.
DBCC results for 'MVTSTK'.
There are 9338432 rows in 491502 pages for object "MVTSTK".
CHECKTABLE found 0 allocation errors and 6 consistency errors in table 'MVTSTK' (object ID 70291310).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (NEUMANN.dbo.MVTSTK).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I read a bit of information regarding the error messages online, but have essentially become overwhelmed with the amount of information available.
Essentially a greenhorn in the world of SQL, inherited the Database as part of the multifaceted role that I am in
Has been running for a year without any SQL errors such as the above, and a bit lost on what to do.

As the initial repair has not worked that well, do not want to keep doing repair_allow_data_loss as a bandaid for the rest of the life of the DB, has also been a lot of late nights troubleshooting and reading and trying to get my head around the issue.
One site recommended repairing the index on the INDEX ID mentioned in the error message, tried this, and had no luck
I restored the Database [To a Test Instance] that was made an hour after the repair_allow_data_loss and ran checkdb again and noted that the errors were there.
Did i mentioned that immediately after the repair i ran a checkdb and got no errors.

Hoping to be able to fix up what we have, as production DB runs 24/6.5 so not wanting to inform management that we might have to go back in time almost 10 days or more!

One other thing to note
Ran the command DBCC CHECKDB ([TESTINGDB]) WITH No_INFOMSGS, ALL_ERRORMSGS on the test database that was repaired and which has been used for training since being repaired and got no error messages.

All up confused, overwhelmed and would appreciate if someone could assist with this issue, or point me in the right direction

Kind Regards
Dan

yosiasz · February 6, 2017, 4:03pm

When was the last clean backup taken?

Kristen · February 6, 2017, 5:56pm

Were you able to figure out what data was "lost"? Is it a problem that some data was lost? or are the users able to recreate it? If it is just indexes, for example, then Drop/Create the index might be enough of a fix. If it is actual user data records then it would be necessary to figure out which data records [IME that would be No Mean Feat ] and have the users recreate them.

If User Data Records have been lost that might mean that Referential integrity, between tables, has become invalid. SQL won't be checking for that - it did the checks when the records were Inserted/Updated/etc. and won't be expecting that something just "disappeared", so that sort of issue would be liable to make any APPs fall over - they too won't be expecting some associated data to go missing (e.g. you've got a Client Invoice but no Client Name record any more - I expect you get the idea )

Is the database running in FULL Recovery Model? I would hope so for a 24/7 DB. If so it would have been better to:

Restore last FULL backup onto your TEST machine. Check consistency on that. If THAT is broken then repeat with an earlier Full backup UNTIL you find one that IS clean.

Apply EVERY Log backup IN ORDER since that FULL backup was taken.

(If you also have differential backups you can apply the ones taken after the FULL backup (i.e. up to the next Full backup was taken) to see if a later Differential backup is still clean. If you can restore a "much later" differential backup it reduces the number of LOG backups that you have to restore)

Repeat the DBCC. If that had been clean (and IME there is a better than 90% chance that it would have been) then you could have arranged to do that on LIVE to repair it with zero data loss (but some short downtime).

NOTE: When doing this for real, on Live, you can take one last "tail" log backup and include that in your restore, if that all restores clean then you will have zero data loss.

Has DBCC CHECKDB been run regularly in that time? If not the errors may have been there all along and just not surfaced

My money is on a hardware fault on the primary machine. If that is the case then, sadly, it will reliably corrupt the database every time the hardware does its faulty-thing.

ScottPletcher · February 6, 2017, 6:17pm

There's good news, bad news and semi-good news:

The good news is that the MVTSTK table errors will be easy to correct.
The bad news is that the BARTAG table errors will not be. The semi-good news is that the suspect_pages table can give you some idea of the last time the error was affected.

/* for MVTSTK /
SELECT name FROM sys.indexes WHERE object_id = 70291310 AND index_id = 4;
/ Use that index name to script out the index, drop it and recreate it. That should fix that index/table. */

/* for BARTAG */
SELECT * FROM msdb.dbo.suspect_pages WHERE database_id = DB_ID('DatabaseName')

jeffw8713 · February 6, 2017, 7:34pm

Before you move forward with trying to fix this - you need to track down the hardware issues that are causing the problems. You stated you ran the integrity check with repair allow data loss - and then ran it again with now errors.

Now you have errors again - which tells me you definitely have an IO/Hardware issue that is causing these problems.

If you do not address the hardware issues you will continue to run into issues with corruption and lost data. I highly recommend that whatever you do - once you have a clean database with no errors you need to get a clean backup immediately. Without that you will continue to be at risk of losing data because you cannot restore to a known clean backup of the database.

DanRed · February 6, 2017, 9:49pm

Hi Scott

Thanks fo much for your response

Just heading back into work, so will try this as soon as I get in

Will let you know how it goes.

Very very much appreicated

Kind Regards

Dan

DanRed · February 7, 2017, 12:15am

Thanks Kristen

Much appreciated.
Ill have to get our hardware guy to confirm that all is okay on the hardware side again
DBCC CHECKBD was run every Tue, Thurs and Sunday as a SQL Maintenance Task
Looking through this job history first error appears to have started on the 26th Jan, which is in line with the last known good backup

Was not able to work out what data was lost, on the report after running it all said repairs and not losses.
Yes FULL Recovery for the DB
Will start reading up on how to restore the differentials...

I note that our TestDB that is on a different server and was repaired is not getting any errors, a week after repair, await to see what a new review of hardware for production environment yields.

Thanks for quick response
Much appreciated
This land of SQL is a big learning curve.

DanRed · February 7, 2017, 1:42am

Hi Scott

Have run the scripts
The STKMVT gives the following
1 DBNAME_MVCLE03

Have found this in the Index for the table,
Am looking up the how to script out, drop it and recreate it

The BARTAG gives the below information

Hardware team have gone through the logs for the SAN and have noted that there was a power issue on the dates surrounding the issue on the SAN and have moved the data to a different storage location.

Off to Test to do some testing...

DanRed · February 7, 2017, 4:30am

Hi Scott
Sorry, feeling very needy here
looked up Google on scripting out the index, drop and recreate and a bit perplexed
Did the following in MSSS
Navigated to MVTSTK table, Indexes - Found MCVLE03
Right Clicked and selected

and then ran the script
USE [ASTOCKTEST]
GO

/****** Object: Index [DBNAME_MVCLE03] Script Date: 07/02/17 2:28:31 PM ******/
DROP INDEX [DBNAME_MVCLE03] ON [dbo].[MVTSTK]
GO

/****** Object: Index [NEUMANN_MVCLE03] Script Date: 07/02/17 2:28:31 PM ******/
CREATE UNIQUE NONCLUSTERED INDEX [DBNAME_MVCLE03] ON [dbo].[MVTSTK]
(
[PREV_DATE] ASC,
[ITEM_CODE] ASC,
[CREAT_NUM] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

This cleared the Consistency logs on after running DBCC CHECKTABLE (MVTSTK)
Just verifying this process was correct.

Thanks in advance.

Dan

Kristen · February 7, 2017, 8:39am

You don't need differentials, but if you have them they would reduce the number of LOG backups you have to restore.

Here's how I see it:

You could, retrospectively, figure out what data was lost. Then decide if/how that data should be re-created (you might, for example, just INSERT the missing records from a clean, restored, test database into the Production database, or you might print them out and Data Entry folk recreate them)
You should get a Clean parallel-database running on Test, and keep it in Sync with Production, so that if/when production fails, or hardware is known-fixed, you can use that to reinstate a, clean, copy on Production. With luck you will be able to do that with zero data loss.

Need a Date/Time when this was done. I'll assume 04-Jan-2017 10:00 for purposes of this explanation.

Finding the lost data from Repair_Allow_Data_Loss

Find the first known-good Full backup before that date, and all LOG backups after that date (If you have a DIFF backup then great, if not [or if you don't "get" DIFF backups] forget about that bit).

Restore the FULL then DIFF if you have one to your Test server. Run DBCC CHECKDB to check that Full Backup Restore is clean. If not go back to an earlier one. Once you find a clean one then restore it, again, but using NORECOVERY so that you can restore more files and this time also restore all subsequent LOG backups. On the final LOG restore you may have to use STOP AT '20170104 09:59'. (If you have continued to use NORECOVERY on all Log restores you will have to do one last Restore using RECOVERY to make the database "active")

Run a DBCC CHECKDB on the Test DB to see if that it is clean. It may be that the hardware error has trashed the LOG backups too, if that's the case then, sorry, all bets are off.

If there are no errors at that point you have a clean database as-of 04-Jan-2017 10:00

Also restore a, different name, database from AFTER you did the Repair_Allow_Data_Loss (as soon after 10:00 would be good). Not sure if Log backups will restore across a Repair_Allow_Data_Loss, if not that will have to be the next Full backup after that point. If log backups are possible then use the above procedure to restore (to your second test database) and STOP AT '20170104 10:05'. (i.e. a time when DBCC had finished repairing).

You now have two test databases from about the same time, TEST1 is clean and TEST2 with data loss. You can construct some queries to find all the missing data records and figure out a way to re-enter them, or just SELECT & INSERT them back into Production.

If you cannot do this straight away I suggest you "secure" all the backup files (before they are deleted / purges / etc.) that you would need in order to be able to do this later. In order to do that you may have to restore FULL Backups (to test) to find which one is definitely the most recent "known good" full backup, then secure all the Log backups after that, plus the first Full Backup after the Repair_Allow_Data_Loss (it would help if that was clean, but it will still be some use even if not)

OK ... that, although rather tedious, process will enable you to find all lost data records - at some future point - if it turns out that you need to, etc.

Ongoing hardware faults on Production Server

I would suggest that you restore the last known good full backup AFTER the Repair_Allow_Data_Loss to your test server (and use DBCC to check that it was clean).

If that is OK then I suggest you restore (to test) all LOG backups after that, using STANDBY. This will leave the database in "restoring" state but let you interrogate it. My thought is that you could continue to restore ALL Log Backups from Production into that TEST database, check that you continue to have a "clean" database, and then if there comes a point-in-time where you need to sort out Production (once the hardware is fixed) you can take a final Tail log backup on Production, restore that to Test, and then take a full backup of Test and restore that to Production. If that is all "clean" there will be zero data loss. (There are a few other operational issues - such as preventing all external access to Production before taking your Tail log backup, and so on, but we can help with those if & when you get to the point of actually needing to do this)

I do think it would be worthwhile getting a Restore on Test running, if only to prove that your Log Backups are NOT corrupted, because if they are clean you can suffer some hassle on production whilst the hardware is getting fixed in the knowledge that you have a zero-data-loss solution available to you. Do NOT run a Repair_Allow_Data_Loss on Production again, if you get to the point where APPs are breaking etc. i.e. because of database errors, then you will need to do the Tail Log backup on Production, restore to Test, use DBCC to check that Test is clean and if so Backup Test and restore over Production and that will get Production clean again (but DBCC Production at that point to make sure that the Restore hasn't immediately corrupted)

Yes, that's fine. All it has done is throw away the index, and then make a brand new one in its place. I can't think of anything else that might be dependent on this, or related to it, that you would also have had to do - i.e. this is fine as a completely standalone task with no consequences. No doubt other good folk here will speak up if I#m wrong on that!

jason_clark · February 7, 2017, 10:14am

There could be several reasons behind consistency errors on database, A lot of time bad hardware is the cause. take look over here: https://sqlpowershell.wordpress.com/2016/07/11/fix-database-consistency-errors-reported-by-dbcc-checkdb/

It might be helpful for you.

Kristen · February 7, 2017, 10:23am

Good point in that article about setting the Verify Checksum Option, particularly on old databases which have been upgraded through newer versions of SQL Server it may well be still set to to TORN_PAGE or NONE

CHECKSUM will provide the earliest warning that something has gone belly-up - possibility at the risk of user-experience disruption

DanRed · February 8, 2017, 9:46am

Hi Kristen and Everyone
Thanks so much for all your help so far
Have someone assisting with the issue at the moment
Am going to try your suggestions above in test environment
Thanks again
Dan