Failover works, Failback does not

Camosun · May 4, 2020, 6:32pm

The server that we want to be primary will failover to the secondary server but the databases/AGs will stay with that secondary server (marked as Primary) They do not Failback. I suspect it is because, in Failover Cluster Manager both are marked as Preferred Owners.

Figuring that it would be as easy as unchecking the secondary server so both are Possible Owners but only the primary server was a Preferred Owner would be the solution. Making this change (unchecking the secondary server) seems to revert back to both being selected as Preferred Owners. Sometimes in minutes, sometimes in a hour or more (not sure how long).

am I on the right track, is the problem that the secondary server should NOT be a Preferred Owner?
How do I correct this if the GUI is enforcing the original settings?

This has driven me to my wits end over a few months. Even considered setting up a nightly reboot to force failback.

Thanks in advance.

graz · May 5, 2020, 2:45am

It shouldn't fail back unless there's an issue with the second server. It won't return to the first server on it's own.

Camosun · May 5, 2020, 1:25pm

So there is no way to have a primary server?

We would like Server 1 to be the top dog. Server 2 is the failover point. The idea would be that when Server 1 goes down the databases failover to Server 2. When Server 1 comes back, Server 2 hands the databases back and quietly fades into the background waiting for the day it can be the hero again.

One reason this is a problem is that Server 2 is Read-Only. Scripts that import data will fail when the primary connection is to Server 2 for too long. From everything that I've read it appears possible but the how of it seems to be contentious.

You're saying this feature was implemented, added into the settings but has never worked, can't work, and is just impossible? That seems very odd that failback is offered but can't be done?

graz · May 5, 2020, 4:42pm

Making one server always be primary isn't a feature of Availability Groups that I know of. Sorry Windows clustering has the concept of preferred nodes (I think) but AGs don't. The AG is mainly using the cluster for quorum and to handle the shared name and IP.

I had a client that tried to make that work and it caused them no end of trouble. You are much better to think of it as two equal servers that can each host the databases. You are best served to design applications and such so they don't care which server they run on.

If you really want one to be a primary, the only work around I know of is to write a script that runs every so often on the "wrong" and fail back to the "right" node.

As far as importing data, I think you'll be much happier if you can make those run when either server is the primary. That usually means moving them off the AG nodes or having some way to run it from either node. You can use sys.fn_hadr_is_primary_replica to determine if a node is currently the primary.

jeffw8713 · May 5, 2020, 4:48pm

That option is for an FCI instance - not for availability groups. The cluster manager does not manage failover or failback for availability groups - that is managed within SQL Server.

Part of the problem is how that secondary is configured and the current mode. Once you failover to the secondary the new primary will be switched to asynchronous - because the old primary is no longer available. When the old primary is back up and available SQL Server can switch to synchronous mode if (and only if) the databases in the AG are fully synchronized.

Until all databases in the AG are fully synchronized - a failover cannot be performed automatically. You can force the failover but that might (could) cause data loss...

Failover Cluster Manager knows nothing about the state of the databases and therefore cannot initiate a failover event.

Camosun · May 5, 2020, 6:33pm

Thanks for the clarification guys. This has helped in digging in our heals and telling the department that has the issue they need to rewrite their code.

jeffw8713 · May 5, 2020, 6:50pm

For this type of configuration - I would recommend 3 nodes and 2 secondaries. Two of the nodes would be configured for HA with the 3rd node configured for read-only. At no time would you allow an automatic failover to the 3rd node - it would be setup as asynchronous only and would not include a listener.

Users would access the read-only secondary directly so all imports/exports/reports would always reference that node specifically.

In this configuration - you could even setup the HA nodes using a traditional FCI. That would require reinstalling though so probably not an option at this point. With that setup you would have 2 AG's - the HA AG with automatic failover and a listener and the RO AG with manual failover and asynchronous and no listener.

You could also reconfigure the cluster and add a file/disk quorum - and then set the node weight on the RO node to 0. This would allow the RO node and one of the HA nodes to be down at the same time without impacting the cluster.

There are many more options...depending on your goal.