Db integrity

Mahmood · December 20, 2020, 2:33pm

Hi All,
The recovery model for all of the db including system and users are set to simple. There is sql agent job which kicks off evey Sunday and perform database integrity check. It fails every time. Here is the sql agent error message:

Message
Executed as user: NT AUTHORITY\SYSTEM. ...e 50000) Server: xxxxxxx [SQLSTATE 01000] (Message 50000) Version: 11.0.7001.0 [SQLSTATE 01000] (Message 50000) Edition: Standard Edition (64-bit) [SQLSTATE 01000] (Message 50000) Platform: Windows [SQLSTATE 01000] (Message 50000) Procedure: [master].[dbo].[DatabaseIntegrityCheck] [SQLSTATE 01000] (Message 50000) Parameters: @Databases = 'USER_DATABASES', @CheckCommands = 'CHECKDB', @PhysicalOnly = 'N', @NoIndex = 'N', @ExtendedLogicalChecks = 'N', @TabLock = 'N', @FileGroups = NULL, @Objects = NULL, @MaxDOP = NULL, @AvailabilityGroups = NULL, @AvailabilityGroupReplicas = 'ALL', @Updateability = 'ALL', @TimeLimit = NULL, @LockTimeout = NULL, @LockMessageSeverity = 16, @DatabaseOrder = NULL, @DatabasesInParallel = 'N', @LogToTable = 'Y', @Execute = 'Y' [SQLSTATE 01000] (Message 50000) Version: 2019-02-10 10:40:47 [SQLSTATE 01000] (Message 50000) [SQLSTATE 01000] (Message 50000) [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:00 [SQLSTATE 01000] (Message 50000) Database: [BILicense] [SQLSTATE 01000] (Message 50000) Status: ONLINE [SQLSTATE 01000] (Message 50000) Standby: No [SQLSTATE 01000] (Message 50000) Updateability: READ_WRITE [SQLSTATE 01000] (Message 50000) User access: MULTI_USER [SQLSTATE 01000] (Message 50000) Is accessible: Yes [SQLSTATE 01000] (Message 50000) Recovery model: SIMPLE [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:00 [SQLSTATE 01000] (Message 50000) Command: DBCC CHECKDB ([BILicense]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY [SQLSTATE 01000] (Message 50000) Outcome: Succeeded [SQLSTATE 01000] (Message 50000) Duration: 00:00:02 [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:02 [SQLSTATE 01000] (Message 50000) [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:02 [SQLSTATE 01000] (Message 50000) Database: [Nectari] [SQLSTATE 01000] (Message 50000) Status: ONLINE [SQLSTATE 01000] (Message 50000) Standby: No [SQLSTATE 01000] (Message 50000) Updateability: READ_WRITE [SQLSTATE 01000] (Message 50000) User access: MULTI_USER [SQLSTATE 01000] (Message 50000) Is accessible: Yes [SQLSTATE 01000] (Message 50000) Recovery model: SIMPLE [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:02 [SQLSTATE 01000] (Message 50000) Command: DBCC CHECKDB ([AliasXXX]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY [SQLSTATE 01000] (Message 50000) Outcome: Succeeded [SQLSTATE 01000] (Message 50000) Duration: 00:00:05 [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:07 [SQLSTATE 01000] (Message 50000) [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:07 [SQLSTATE 01000] (Message 50000) Database: [NectariCubeXFTY] [SQLSTATE 01000] (Message 50000) Status: ONLINE [SQLSTATE 01000] (Message 50000) Standby: No [SQLSTATE 01000] (Message 50000) Updateability: READ_WRITE [SQLSTATE 01000] (Message 50000) User access: MULTI_USER [SQLSTATE 01000] (Message 50000) Is accessible: Yes [SQLSTATE 01000] (Message 50000) Recovery model: SIMPLE [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:00:07 [SQLSTATE 01000] (Message 50000) Command: DBCC CHECKDB ([NectariCube]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY [SQLSTATE 01000] (Message 50000) Outcome: Succeeded [SQLSTATE 01000] (Message 50000) Duration: 00:31:33 [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:31:40 [SQLSTATE 01000] (Message 50000) [SQLSTATE 01000] (Message 50000) Date and time: 2020-12-20 04:31:40 [SQLSTATE 01000] (Message 50000) Database: [NectariV8] [SQLSTATE 01000] (Message 50000) Status: ONLINE [SQLSTATE 01000] (Message 50000) Standby: No [SQLSTATE 01000] (Message 50000) Updateability: READ_WRITE [SQLSTATE 01000] (Message 50000) User access: MULTI_USER [SQLSTATE 01000] (Message 50000) Is accessible: Yes [SQLSTATE 01000] (Message 50000) Recovery model: SIMPLE [SQLSTATE 01000] ... The step failed.

Will setting the database recovery model to full solve this issue ?

Thanks
Mahmood

jeffw8713 · December 20, 2020, 4:02pm

No - changing the recovery model will not fix a corrupted database. You need to get the full details - or perform a manual integrity check for every database. Execute the following for every database:

DBCC CHECKDB({your database here}) WITH no_infomsgs, all_errormsgs;

If you post the messages from the database(s) that fail - we can help you with the next steps. With that said - the best way to recover from integrity issues is to restore from a known good backup, however I have a feeling you don't have that available as this has been going on for weeks.

I will say that running an integrity check once a week is not optimal...it should be run every day if possible and when it fails you need to address the issue immediately. Integrity issues are almost always due to some type of hardware issue - which must be addressed before you do anything else or it will cause an issue again.

You want to check your application/system event logs on the server - look for any IO related errors and have your server/storage team correct those problems.

Mahmood · December 21, 2020, 5:06pm

Thank you for your replay. When I executed DBCC CheckDB on one of the server, I got the following error:

Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

The drive that has Tempdb has 5 G of disk space. I suspect we will need to add more disk space.

thanks
Mahmood

Mahmood · December 21, 2020, 5:16pm

Received the following error when I executed sp_blitz:

2020-12-09 23:14:38.12 spid63 Error: 824, Severity: 24, State: 7. 2020-12-09 23:14:38.12 spid63 SQL Server detected a logical consistency-based I/O error: stale page (a page read returned a log sequence number (LSN) (283:277936:1) that is older than the last one that was written (0:0:0)). It occurred during a read of page (7:207976) in database ID 2 at offset 0x000000658d0000 in file 'G:\SQLData\tempdb6.ndf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2020-12-20 06:27:11.94 spid64 Error: 823, Severity: 24, State: 7. 2020-12-20 06:27:11.94 spid64 The operating system returned error stale page (a page read returned a log sequence number (LSN) (289:633856:1) that is older than the last one that was written (0:0:0)) to SQL Server during a read at offset 0x0000006acec000 in file 'G:\SQLData\tempdb1.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2020-12-21 10:47:29.15 spid55 Error: 823, Severity: 24, State: 7. 2020-12-21 10:47:29.15 spid55 The operating system returned error stale page (a page read returned a log sequence number (LSN) (291:133336:1) that is older than the last one that was written (0:0:0)) to SQL Server during a read at offset 0x00000007f6c000 in file 'G:\SQLData\tempdb5.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2020-12-21 11:51:55.10 spid67 Error: 823, Severity: 24, State: 7. 2020-12-21 11:51:55.10 spid67 The operating system returned error stale page (a page read returned a log sequence number (LSN) (291:197056:1) that is older than the last one that was written (0:0:0)) to SQL Server during a read at offset 0x0000003a00c000 in file 'G:\SQLData\tempdb8.ndf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

jeffw8713 · December 21, 2020, 5:59pm

I would add more space to the drive holding tempdb - and see if that fixes the issues. If that doesn't work and you still cannot get past the issues - you may be stuck with rebuilding the system and restoring the databases from backups.

Do you have backups you can recover from?

Mahmood · December 21, 2020, 7:28pm

Thanks for the suggestions. I am afraid that we may not have the backup, however, I will look into this. I just join the teams about 1 week ago.

Mahmood

yosiasz · December 21, 2020, 7:50pm

start the back up yesterday

jeffw8713 · December 21, 2020, 8:20pm

If the corruption is on the drive where tempdb resides - you could try to recreate the files. This is risky as we cannot guarantee that SQL Server will restart if the system databases are corrupted...

To rebuild the files - shut down SQL Server. Remove the tempdb data and log files (or move to a different location - just in case). Restart SQL Server...SQL Server will rebuild the tempdb files at their defined initial sizes.

If your tempdb drive is out of space - you can attempt to shrink the files using DBCC SHRINKFILE. Use the second parameter to define the size you want the file to shrink - and make all files for tempdb the same size, leaving at least 10% of the drive free.

Again...this could be risky - the system may not start back up if the corruption has affected your system databases. However...if that has happened you already have a problem where you are going to need to rebuild.

I would validate that I at least have backups that can be restored for all databases - and if not, then back them up now (if possible).

Mahmood · December 21, 2020, 8:27pm

Thank you