AOAG on Azure - can't reach listener

I've set up an Always On Availability Group in Azure. Four VMs:

  1. Domain Controller DC
  2. SQL Server 1
  3. SQL Server 2
  4. Windows 10 client machine

The servers are all running Windows Server 2012 R2 Datacenter. SQL is 2012 SP3. Everything is up to date.

I built an AG using the two SQL servers. It works and I can fail over back and forth without issues. There is also a listener (call it AGListener) which I can connect to from SQL Server 1 only. That's the first issue

The second issue is that I cannot connect to the listener from the Windows 10 client machine either (probably the same as the first issue, now I think of it).

From SQL 2 and the Win 10 VM, I can run nslookup AGListen. It works and returns the correct IP. When I try to ping that IP, I get "host unreachable" which is nonsense to me since all four vms are on the same subnet (10.0.0.5, 6, 7, 8) as is the listener (10.0.0.100) and the cluster (10.0.0.254)

Any clue what I'm missing or how to debug this?

Are SQL Server 1 and 2 in different subnets? If so, you'll need to specify MultiSubnetFailover=True in your connection string. You can do that in SSMS using the Options button and then Additional Connection Parameters screen. Just copy/paste it into there.

You can only use MultiSubnetFailover=True if the database driver supports it.

Handy chart: https://msdn.microsoft.com/en-us/library/hh510238(v=sql.120).aspx

SSMS does support it.

What happens when a database driver supports it and the multi-subnet connection parameter is included is that both IP addresses are tried at the same time. Whichever one returns first is the one you'll connect to.

Without the proper database driver and without the connection parameter, the IP addresses are tried one at a time. By the time it tries the second one, the connection probably will have timed out already.

The order of the IP addresses will be 50/50, meaning some times you'll get the "online" IP first but sometimes second. That's why ping will sometimes show the "offline" IP.

Oh I just noticed you said they are ALL in the same subnet.

I would still try the MultiSubnetFailover=True connection parameter as it's recommended to use it for all AG connections as it'll connect faster even if it's not a multi-subnet config.

What error are you getting?

Try flushing DNS on SQL 2 and the Windows 10 machine. Seems like it's not able to resolve the listener to an IP.

Did you create an Azure Load Balancer for your VMs?

no, I didn't

Did that

TITLE: Connect to Server

Cannot connect to greenchairporta.


ADDITIONAL INFORMATION:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The wait operation timed out.) (Microsoft SQL Server, Error: 258)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=258&LinkId=20476


The wait operation timed out


BUTTONS:

OK

no change. Note that I've also tried to connect using the listener's IP address. Same result

Take a look here:

https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-classic-ps-sql-int-listener/

Depending on your deployment model (Classic or Resource Manager) you'll have to do different configurations. You'll also have different methods under Classic if you're using Internal or External LBs.

The first link is for a Classic deployment and it very soup-to-nuts, make sure to read it all the way through before doing anything. If you've already set up the VMs you can probably skip everything about it except the Load Balancer.
The KB2854082 part is probably unnecessary but verify it anyway, especially if you still encounter problems.

Steps 9, 10, 11 and 12 are the critical ones. MAKE SURE YOUR POWERSHELL RUNS AS ADMINISTRATOR. (this is a very frustrating requirement). You can probably set up the load balancer through the Azure Portal, but the cluster stuff has to be done in PowerShell.

The 2nd link covers Resource Manager deployments, but there are buttons for Classic as well. If you're using Classic the first link should be enough.

One other thing: check your DNS server for the Listener name entry, and also check its "Update associated pointer (PTR) record" setting is true.

Here's a snippet of PowerShell that should get you set up on the cluster side, after the LB is configured:

$AG = Get-ClusterResource | Where-Object { $_.resourcetype -eq "SQL Server Availability Group" }
$Listener = Get-ClusterResource | Where-Object { $_.Name -like $AG.Name + "*" -and $_.resourcetype -eq "Network Name" }
$IP = Get-ClusterResource | Where-Object { $_.Name -like $AG.Name + "*" -and $_.resourcetype -eq "IP Address" }

Get-ClusterResource $IP.Name | Set-ClusterParameter -Multiple @{"ProbePort"="59999";"OverrideAddressMatch"=1;}
Get-ClusterResource $Listener.Name | Set-ClusterParameter -Name "PublishPTRRecords" -Value 1
Get-ClusterResource $Listener.Name | Set-ClusterParameter -Name "HostRecordTTL" -Value 60

The PublishPTR setting should set the DNS properly (see above). HostRecordTTL typically defaults to 1200 and can lengthen failover times, so adjust accordingly (60 in the example here). Google that setting to find advice on how to configure it.

I'm using RM model here. Hadn't seen that article before. It actually makes things a little too easy! I really wanted the "roll your own" experience. My purpose is not to deploy an AG for business use but to learn to build and troubleshoot one to gain experience.

Ok so I'd like to do this, but getting lost in the DNS Manager. From the Domain Controller, in DNS Manager all I see is one entry: the one for the DC. Is this normal? None of the other machines in my lab are there, nor is the AG listener or the cluster.

Got this error:

Set-ClusterParameter : Parameter 'PublishPTRRecords' does not exist on the cluster object 'GreenChairAG_10.0.0.10
If you are trying to update an existing parameter, please make sure the parameter name is specified correctly. Yo
check for the current parameters by passing the .NET object received from the appropriate Get-Cluster* cmdlet to
Get-ClusterParameter”. If you are trying to update a common property on the cluster object, you should set the
property directly on the .NET object received by the appropriate Get-Cluster* cmdlet. You can check for the curre
common properties by passing the .NET object received from the appropriate Get-Cluster* cmdlet to "| fl *". If yo
trying to create a new unknown parameter, please use -Create with this Set-ClusterParameter cmdlet.
At line:1 char:32

  • ... ce $IP.Name | Set-ClusterParameter -Name "PublishPTRRecords" -Value 1
  •               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:slight_smile: [Set-ClusterParameter], ClusterCmdletException
    • FullyQualifiedErrorId : InvalidOperation,Microsoft.FailoverClusters.PowerShell.SetClusterParameterCommand

Can you expand anything under the DC node? You should see "Forward Lookup Zones", "Reverse Lookup Zones", and some others.

To confess I haven't used Resource Manager and I know there are differences, but still, a missing DNS entry tends to suggest part of the problem. Can you ping/tracert/nslookup the listener from the other machines? If not, can you connect to the listeners IP address?

yup I see my Domain there. and under that, _sites, _msdcs, DomainDnsZones, ForestDnsZones

Nope, that's why I posted.

From the current Primary, yes, from other machines on the same subnet, no

Hmmmm, there's an AD permission to "Create Computer Object" that is required for the user that creates the cluster and the AG, I'm assuming you're a domain admin or you've had that permission assigned? Does the cluster machine show up in AD and DNS?

the cluster has been created, it is up and active. So is the AG. the cluster and Listener are both in AD. Neither is in DNS (but then it doesn't matter, since there is no public access anyway. only from the local subnet)

Last thing I can think of is connecting to the listener using the fully qualified name (listener.domain.com).

Is the listener configured in the Cluster Manager? Does it have the proper dependency on the listener IP address?

I'm not a DNS expert but I'm pretty sure that will need to be in place for it work properly. Is the DNS server listed in the SQL nodes TCP config?

Same sad story:-(

Yup and and yup

Yup

How about the subnet mask on the listener IP? And the other machines as well? That is the last thing I can remember getting snagged by. Make sure to do an ipconfig /flushdns after any of these changes (might want to try one before as well).

I forgot to mention earlier that PublishPTR is visible in the cluster manager GUI for the listener resource. If there's no checkbox for it then I think there's a problem.

After that, I have to punt. I'd suggest contacting Microsoft, or hit up the #sqlhelp tag on Twitter. Point them to this thread.

Good luck, sorry I couldn't be more help.

I'm in the Cluster Manager GUI but I can't see that property anywhere. Where would I find it, specifically?