I have SQL 2019 Enterprise running HA. Both machines have their own disks so they need to have SQL running on startup to have synchronization working.
The issue here is that with Failover Cluster, when I move the role, the secondary STOPS the service and therefor can't failover anymore How do I prevent this?
I think you have mixed up FCI vs AG - and how they work.
For an FCI - you install SQL Server into the cluster with storage allocated to the cluster and shared across nodes. In this configuration, SQL Server will only be running on the node that is currently hosting the instance. When you failover - the storage is 'moved' to the other node and SQL Server is started on that node using the shared storage. This is done in cluster manager and managed completely by the cluster.
For an AG - you have separate instances installed on each node. Those instances are not installed into the cluster and there is no shared storage available to the cluster. Each instance has its own storage and is running as a local instance on each node. In this configuration, you create an availability group in SQL Server and failover is managed within SQL Server only. The cluster manager cannot be used to failover the AG.
So the question is - do you actually have an FCI or do you really have an AG? If an AG - then do not attempt to failover any resources using cluster manager, only failover using SQL.
Then what fails the IP over to secondary?
Are you stating two different systems, used completely separately? i.e. FCI is used when you have shared storage, vs AG is when you don't. The question then comes what allows the secondary to take the primary IP etc?
So I have a AG group, but it needs availability group listener that is the FCI?
The listener determines which instance is the primary instance and redirects the connections to that instance. It is configured inside the AG resource that is created - but is wholly managed by SQL Server and not managed by the cluster.
If you have an AG you don't have an FCI - which is a Failover Cluster Instance of SQL Server.
In an AG - when you issue the command ALTER AVAILABILITY GROUP [group name] FAILOVER on the secondary that you want to become primary, SQL Server validates that the secondary is ready for failover (i.e. synchronized) - then moves the resources to the new primary.
When you issue the above command - you will then see the resource in cluster manager being taken offline and brought online on the new node. That resource also contains the listener IP address - so the listener has been moved along with the AG resource.
If you attempt to move the resource using Failover Cluster Manager - you break the AG because cluster manager doesn't know anything about the synchronization state of the databases within that AG.
So it still needs FCI installed to do the AG?
Ok, so, don't need to mess with FCI, ok that sound good. But yet when I shutdown/restart a server, the IPs fail to the secondary, as they should, but when the primary comes up, SQL does not start, it always is set to manual. ?
No - you are still confusing the terms. An FCI is a 'Failover Cluster Instance' - which means SQL Server installed into the cluster.
For an AG - you need the cluster setup and configured. SQL Server is installed on each node as a local instance on that node. It is NOT installed into the cluster and the cluster doesn't know anything at all about SQL Server. The cluster has NO control over SQL Server or any of the disk resources for SQL Server.
As a standalone instance on each node - SQL Server will be started on each node and set to automatically start the services. SQL Server must be running on each node in order for the AG to be able to communicate to that instance and synchronize the databases included in the AG.
When you go into SQL Server on the primary replica and setup the AG - SQL Server creates a resource in the cluster for that AG. That resource is NOT managed by the cluster manager - it is managed by SQL Server. That resource will contain the AG name and listener IP address(es) - and that is what manages access to the current primary replica.
In an AG - you should never just restart a node unless you are specifically testing the failover process. The general rule is to restart the secondary - when it is back up and SQL Server is running - and all databases are synchronized, then you perform a manual failover (ALTER AVAILABILITY GROUP [group] FAILOVER) on the secondary you want to be the new primary. Once all databases have been synchronized in the AG - than you can restart the old primary (now secondary) instance.
SQL Server must be set to automatically start on each node - cluster manager doesn't know anything about SQL Server so it cannot manage the state of each instance.
I think this is where I setup HA and was wrong. .
You create the FCI Cluster and add the servers to it, but you DON'T define roles, I did this and selected the service role, hence why one would stop when we swapped them in HA. This explains why the services were always stopping during failover.
I ended up removing AG from both servers, then removed the ROLE that should not have been there. Then when I added AG again, it created the role for me. This is what I done wrong, and why it was doing what it was doing. I was unclear the relationship between FCI and AG. This SERVICE ROLE that I installed in FCI was doing exactly what it should do if you have Shared storage, only one SQL server can talk to that database. Makes sense...
So, now, I have a role, its not a service role that was installed by AG, along with the cluster.
Its clear in my head .. Thank you VERY much Jeff.
Just to be clear - you keep referring to the cluster as an FCI. The correct reference would be WSFC (Windows Server Failover Cluster) - and FCI is a specific to a single instance of SQL Server in the WSFC.
See this document: Always On failover cluster instances - SQL Server Always On | Microsoft Learn