Can High Availability Lag affect the Primary Server?

We are still using SQL 2014.

We had some significant lag on one of our Secondary Server involving High Availability. It was bad enough that it stopped trying to catch up, until we restarted the SQL Service. That solved the problem until a week or so later it happened again. This happened several times until we gave up and dropped that Server from the Availability Group.

During this same period, we were having some performance issues on the Primary Server (i.e. same AG groups), during peek hours.

A few people on our team tried to say it was very likely that these 2 issues were connected. i.e. the H.A. Lag on that Secondary Server caused/contributed to the Performance problems on the Primary. For the most part, these 2 events didn't occur at the same time.

Has anyone experienced H.A. lag causing performance issues on the Primary - or think it is possible in some way? I personally don't see how this is possible. Aside from the obvious that the Tran Log files will continue to grow and eventually take over the Drive on Production.

This eventuality is a bit of a concern. That should not happen if you plan for it well and have mitigation plans in place, disk size bloat alerts, backups,maintenance plans etc

If you have a synchronous replica, it absolutely can impact primary performance. Write activity on the primary will hold locks until the synchronous replicas acknowledge that the log records are hardened. This can block any other access to the affected tables, depending on the transaction isolation level. Even without blocking, AG network traffic can saturate the connection if there's enough activity, leaving less bandwidth for other requests. Changing to an asynchronous replica may alleviate some of this stress.

Recommend not using synchronous AG replicas if the 2 instances are not in the same physical location, unless you have a dedicated network circuit between locations. Even tens of milliseconds of latency can become problematic. At my last job we had an asynchronous AG replica between 2 different Azure data centers (bandwidth and latency were not problems), and we still had it fail to replicate to the point where we had to remove it. While it didn't impact performance much that we noticed, it did cause the transaction log to explode in size, which made the sync even further behind.

Sorry - I forgot to mention, this is all Asynchronous.

Good point about Network Traffic. We have Application traffic only going to the Primary. So could the traffic going from the Primary to the Secondary for H.A. still be problematic for the Primary?

Yes. For instance, if you rebuild the clustered index on a large table with many non-clustered indexes, all of them will be rebuilt as fully logged operations, all of which need to be replicated. None of this is actually changing data, by the way, it's just rewriting the structures. The log will likely grow, or at least fill up. If you rebuild the index as an online operation, that will allocate additional data pages, possibly (likely) growing the data file too. All the allocations add log records that need synchronization.

If you're synchronizing over an internet connection, life's going to suck if you have a lot of write activity. If you have the option to use a dedicated circuit, a WAN or LAN, or just some lesser-used subnet in your network, that will help your AG traffic. Any isolation from other network activity always helps.

Also check your cluster settings and network adapter settings. If you're not replicating to the cloud, or within the cloud, check TTL, PTR and adapter settings for things like jumbo frames. Make sure to research any setting BEFORE changing it, you can easily make things worse (e.g. jumbo frames)

2 Likes

Here's some AG videos that might help you:

If your log files needed to be extended during this time - that would definitely impact performance on the primary. Transaction log files need to be zeroed and that takes time - depending on the size of the extent added to the file and how many times it needs to be extended.

Yes, this is within Azure though separate Data Centres.

I'm still trying to understand the example where we rebuild the clustered index on a large table with many non-clustered indexes.

Yes, the chances of Lag will be greater. Yes the Log files will grow. But assuming we have enough drive space, how could this affect performance on the Primary?

For the index rebuild, disk write activity would impact performance, as well as increased network synchronization traffic. All of this has CPU overheard on the primary, whether the AG replicas are lagging or not. There's still some acknowledgement traffic on an async replica, though it's much less than a sync replica. If there's a completely broken connection between primary and secondary you'll lose some performance on the primary.

Don't get me started on cluster quorum on Azure virtual machines. There are a few other aspects of Azure VMs that impact AG performance, like the aforementioned PTR record. Suggest reviewing this and see if your configuration could use some tweaking:

From my purely anecdotal personal experience, IMO there are some gremlins in Azure that impact AG performance, and I was never able to get a reliable answer as to the cause. MS would investigate and find nothing concrete to fix. We did have one incident that was pretty serious, in the end MS found that they had defective network hardware that our VMs were using. Reprovisioning a new VM fixed that problem, and they ultimately removed the bad hardware.

I know this probably doesn't help you much, and it's just my opinion. If you're continuing to see performance impact that you can't explain or remedy, open a support ticket with Microsoft.

Why would the log files grow? Are you shrinking them after an index rebuild? If not - then the log files wouldn't need to grow once they reach the size needed to support the index rebuilds.

It seems you have 2 issues - the first is that you generate a lag large enough that you removed that database/server from the AG. The second is that you identified performance issues occurring at the same time as this lag - and the performance issues are occurring during peak hours.

For the first issue - you need to identify what processes are causing the lag to occur. What large transaction is being run that generates so much data that the AG falls behind. Identify that process and you can then determine the next steps. If there are no large transactions - then something on that server is preventing the redo log from being applied and you need to identify that process.

For the second issue - you need to identify the blocking processes causing performance issues.

Either way, just assuming that the AG is causing the problem isn't enough. It may be the send queue on the primary - but more likely the problems are related to some other process and the lag on the secondary is just another symptom.

3 Likes

I meant that if the lag is bad enough, SQL Server can't truncate during log backups. And the Logs will eventually be forced to grow.

What was happening was, after the check DB (not sure why), SQL Server just stopped trying to update the DBs on the secondary server. The sync lag just continued to grow, until we restarted the service. So I turned off the check DB job on the secondary and we didn't have any problems since. But then there was a group decision to remove that secondary. But we still obviously kept one secondary, which is in the same Data Centre. With that server, we didn't have any significant lag.

I agree we shouldn't assume the AG issue caused the performance problems. And personally I did not think it did.

Based on this thread, the only contributions AG issues could have towards Production performance are potentially significant but not likely the most significant, being: A) zeroing out the logs and B) increased Network activity.

Thankyou.

Yes, we've had Azure issues as well. Like short blackouts where either the Server or the Network just dies.

Agreed. But so far this is something we haven't proven that the AG is causing the Production problems. Yet at the same time we haven't proven it's not.

With regards to the Network activity, I personally don't know of a way to determine something like: x% of our Network Activity is due to AG and it's causing y% additional CPU in Production.

If you can risk a few things, you might try:

  1. Logging the perfmon counters for availability group traffic, and then suspending synchronization. You'd need to do this during problematic activity, and compare them (syncing vs. suspended). You'll at least get some insight as to the total AG bandwidth in use. Including Network Bytes/s will let you get a percentage of total.

  2. Set up some extended events to capture availability group events, and again test during sync and suspension. There's the AG session that is automatically added, but you'll want to set up a new XE session with other events. You'll incur I/O overhead on this, as there could be thousands of events captured, so I'd put this as a last resort and only if you absolutely cannot get the info any other way. You'd be better off filing a case with Microsoft and running that to completion before trying this, and ask them if they'd recommend it.

Have you looked at the 2 videos I posted earlier?

Yes. I have watched both. Very informative.