Hi, I hope someone can help. I will try to summarise as best as I can.
We have four servers in a cluster with an AG set with Database Health Detection. Three servers are on site1 with the fourth on a separate DR site2. We are testing DR by cutting the link between the two sites.
As expected, when the link is cut, the three servers on site1 stay up and the server on site2 loses quorum and the database goes into recovery mode until the link is re-established.
If I perform a manual failover to site2 before we cut the link, the same thing happens and I have to force the quorum manually to bring the server back up on site2.
This isn't a big deal, but I would prefer it to stay up, so I added an Azure witness to the cluster. However testing has given us unexpected results.
- With an Azure witness, if I shut down the 2 secondary servers on site1 before we cut the link (leaving just one server on each site), then cutting the link initialises an automatic failover to site2. This is great, but I want the other servers to be left up and running.
- If I leave the 3 servers up on site1 up and we cut the link, the server on site2 loses Azure (even though it still has internet) and the server loses quorum. That I don't understand at all.
Wouldn't the server in site2 think that it was the last remaining server and failover to become the primary?
Additionally, I've tried to find out what the exact events are that happen when a failover occurs.
- Does the primary tell the secondary to failover and if so, how does that work if the link is cut?
- Or does the secondary realise that there is no handshake and assume it's now the Primary?
I've tried searching for sources of information. Any help would be greatly appreciated.
Thanks, those links are useful, but don't really answer my questions about why the secondary goes into recovery mode even with the Azure witness, or my additional questions. But I do appreciate the resources.
Quorum needs to be more than half the available votes - and since you lose connectivity with Azure you no longer have quorum. The reason you lose access to Azure is most likely due to routing and network - which needs to be allowed from site2.
Hi Jeff,
Thanks for your answer. That would be the case if the quorum was static, but the quorum is dynamic, so the single server and azure witness can maintain quorum, as seen when I shutdown two of the servers on site1 before cutting the link.
Not if you cannot access Azure from Site2 - which appears to be the problem here. Once the link is cut between Site1 and Site2 - Azure is no longer available. You need to have at least 2 quorum votes - and your single server in DR only has 1 vote out of 2 possible - which is not more than 50% so the cluster shuts down.
I understand the principle, but even though FCM showed Azure as offline, the site2 server still had internet access and the Azure portal didn't show any issues, so that part is strange indeed. In theory though, site2 should have failed over and become the primary, right?
Unless you have the AG configured for synchronous mode and automatic failover - then it will not failover automatically. Since that is a DR - I assume it isn't set in synchronous mode and thus cannot be set to automatic failover.
It is configured for synchronous commit with automatic failover to site2.
And back to the beginning...if Site2 does not have access to Azure then you don't have enough votes to make quorum and the cluster will be shut down because it is not healthy.
And since it isn't healthy it cannot be made the primary.
Let's assume that site2 has access to Azure. If the link between the 2 sites is lost, would the site2 server promote itself because it can no longer see site1?
If it site2 becomes a primary, wouldn't I then end up with both sites having a primary server?