Error in SQL

spanjokus · February 15, 2022, 1:01pm

Good day! There is a SQL cluster very often once every two weeks there is a reboot of the instance, in the logs I see errors:

QL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\MountPoints\DB_Data_2\DB\Name_DB.mdf] in database id 23. The OS file handle is 0x0000000000002240. The offset of the latest long I/O is: 0x0000b26e132000

A component on the server did not respond in a timely fashion. This caused the cluster resource 'SQL Server (Name)' (resource type 'SQL Server', DLL 'sqsrvres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.

Cluster resource 'SQL Server (Name)' of type 'SQL Server' in clustered role 'SQL Server (Name)' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

It seems to me that this is due to the high load on the storage system, I am trying to follow the recommendations from the Microsoft website, but I will be glad for any help.

I tried using the ResetWUEng.cmd script, which helped check the integrity of the system and check it for damage, no problems were found.

robert_volk · February 15, 2022, 1:49pm

Also check your Windows event log for Warnings, Errors or Verbose entries. Check both System and Application. Look for any error on disk or I/O operations. If you can identify the component, like a driver or HBA (depending on your storage architecture), filter for all entries on that source.

I recently had to deal with a problem just like yours, it turned out to be bad SAN hardware, we ultimately lost several disks and had to migrate to new equipment. We would see multiple I/O 15 seconds in the SQL log, just before the cluster went offline. The cluster log may also provide more detail, but it won't necessarily pinpoint a hardware I/O problem.

Look up the Get-ClusterLog PowerShell cmdlet, it will output the log into text files, it's easier to search than using the log viewer in the Failover Cluster Manager. See here:

spanjokus · February 15, 2022, 7:13pm

Thanks, I'll try to use the cmdlet and search with it

spanjokus · February 16, 2022, 6:36am

Good day! Please tell me what this message means, I also saw it in the logs. HardDiskpWaitForPartitionsToArrive: Status ERROR_IO_PENDING from IOCTL_DISK_ARE_VOLUMES_READY

robert_volk · February 16, 2022, 1:55pm

I don't know, I'm not a storage expert. I suggest Google/Bing for those items, and failing that, contacting your storage vendor or checking their online support library/forum.