AlwaysOn replicas lose connection every 15 minutes but no neg affect

RogerMack · December 20, 2018, 1:09am

I am tracking Extended Events session for "severity > 10" for something unrelated when I see these msgs every 15 minutes of every hour.
No issues show when monitoring the SSMS AG Dashboard, nor in a separate Extended Events session when I select "Availability Group" events.

This is in production. We are not seeing AG fail-overs. The users seem totally unaffected, but it concerns me.

Over a 2 second interval I receive about 307 ( 28 unique repeating) errors with error codes covering the full range from error # 41401 thru 41428.
All messages are:

severity 16
State 1
Category 2

Example Message:

Availability group is not ready for automatic failover.
The availability group is not ready for automatic failover. The primary replica and a secondary replica are configured for automatic failover, however, the secondary replica is not ready for an automatic failover. Possibly the secondary replica is unavailable, or its data synchronization state is currently not in the SYNCHRONIZED synchronization state.
Availability group is offline.
Some availability replicas are disconnected.
WSFC service is offline
The WSFC cluster is offline, and this availability group is not available. This issue can be caused by a cluster service issue or by loss of quorum in the cluster.
and a bunch others related to these.

We have:
Two nodes in the AlwaysOn cluster with matching versions of OS and SQL Server, synchronous AlwaysOn

SQL Server 2012 Enterprise SP4
(VM) Windows Server 2016 (applied all latest Windows updates 5 days ago)

Any thoughts?

ahmeds08 · December 20, 2018, 5:57am

Anything in the cluster events?

jeffw8713 · December 20, 2018, 7:07pm

How is cluster quorum configured? Is it node majority or file share?
What messages are you seeing in SQL Server log?

Some of those messages indicate a network latency issue - once the session falls behind enough SQL Server will switch from synchronous to asynchronous and will not allow automatic fail over. It appears that there are some issues with connectivity between servers that are intermittent causing disconnects - which then cause the send/redo queues to back up.