SQLTeam.com | Weblogs | Forums

SQL 2012 AlwaysOn... doubts


#1

We have (or I better use the past tense "we had") a master SQL 2012 SP4 Enterprise box replicating to a remote twin SQL box with WFC cluster and AlwaysOn functionalities. Our SQL consultants set it up several months ago and they said it was the best (and cost-effective) HA scenario for our ERP system and our developers could also offload the main system running reports on the remote read-only DBs. That looked great, yeah.
Last week the remote SQL server had a major crash (100% CPU usage, system barely usable) and the master SQL server was bogged down too and the Cluster service couldn't start at all... we had to reboot to secondary server to have a fully working master SQL server... till a new crash (freeze) at the remote site several hours later that caused the main system to crawl again...
So my question is: how can a passive server/copy take down the master SQL box? how can the Cluster service on the master SQL server be so fragile and unreliable? Is this a normal behavior in an AlwaysOn HA scenario? is it something related to SQL 2012? immature system??

Thank you!


#2

First, that is unusual behavior. It should be very difficult for a secondary to affect the primary.

Can we clarify a few things?

  1. Are all three nodes in the Availability Group? You used the word "replicating". Since SQL Server has a "replication" feature I want to make sure we're talking about the same thing.

  2. Did you ever find the root cause of the "crash" on the remote server? What you described is often a poorly written query or server load.

  3. The cluster service not starting is unusual. Did you find the root cause of that?

I've used AlwaysOn extensively and dealt with all kinds of weird situations and it has worked well. Most of my recent experience is with SQL Server 2016 but I can't images SQL 2012 is that bad.


#3

Hello Graz,
thank you for your kind response.

For starters, we did only have two systems in the group, one primary SQL box and a remote secondary box... a few critical DBs had an asynchronous read-only copy there and some reporting engines were using one of those DBs for heavy queries.
Devs and consultants are still investigating the cause of those frequent freezes and high CPU usage but we were baffled by this weird behavior... if the secondary node goes down, the primary node shouldn't be affected at all...
We've seen plenty of "Unable to access database xxxxxx' because its replica role is RESOLVING which does not allow connections" error messages in the primary node but they're still working on that...

I'm glad to know AlwaysOn is a reliable technology in SQL 2016 and we are wondering if this was just poor configuration...


#4

And we've got plenty of these messages:
"The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. "


#5

If you lose quorum, you are most likely going to lose both the primary and possibly the secondary.

How is your quorum configured? Is there a third "thing" that can vote on quorum?

If you only have two nodes in the cluster and the secondary goes to 100% CPU it may not be able to respond to heartbeat requests and then you lose quorum and the primary goes down.


#6

As @graz mentioned - you need to confirm how the cluster is configured. In order for the cluster to be healthy there needs to be more than 50% quorum votes.

If you only have 2 nodes - and each node has 1 vote, when one of the nodes is unavailable (restarted, patching, etc...) then you only have 1 quorum vote out of 2 possible votes which is not greater than 50% and the cluster will shut down.

If the secondary node is in another data center - then you will need a file share setup that is accessible to both nodes where you can assign the file share quorum, or you need a 3rd node in the cluster. This way you will have 3 quorum votes and a single server being restarted or unavailable will still allow for a healthy cluster.

It sounds to me like you have multiple issues here:

  1. Cluster configuration issues
  2. Network latency issues
  3. Possible report code issues

There are several ways the secondary can affect the primary - depending on how things are configured and the availability groups created.