It pays to pay attention to the detail!
I created a Windows Server Failover Cluster (WSFC) on a bunch of Windows 2012 R2 virtual machines running in Azure, and at first, it all just worked. I wrote a quick little PowerShell script to install the cluster features, and then used the MMC console to do what I’ve done a hundred times before; create a basic cluster and add my nodes to it.
Except this time I didn’t do one thing…
Being a bunch of test VMs on Azure, I shutdown my cluster nodes and domain controller at the end of the day, and proceeded to boot everything back up the next day.
Today, I was going to install a clustered SQL Server 2016 instance…Loaded up the ISO, run through the installer, and ruh-roh, SQL says it can’t validate my cluster.
I jumped back in to the Failover Cluster management console and try to connect to my cluster…won’t connect.
I had a similar issue a few days prior, so loaded up PowerShell, imported the failoverclusters module and tried to start the cluster node with the -fixquorum option…. no deal
At this point I was stumped. I jumped on to one of my other cluster nodes and validated the basics; could I talk to the domain controller, file share witness and each of the other cluster nodes?
Everything seemed to be working fine.
Except the original cluster node I was logged on to; I could not for the life of me, ping the other cluster nodes. From the other cluster nodes, I could ping it just fine…what’s going on?!
I loaded up the Failover Cluster management console on one of the other nodes, and low and behold, it connected. Weird. I scratched my head thinking we had setup some weird Network Security Groups or Windows Firewall was acting up, but everything pointed to “you’re going crazy, man”.
It was then that I realised I could ping the original cluster node from my secondary; how could I ping that IP, but not ping it back from the other node? I had covered the firewall side of things, so there’s something odd going on here.
Something odd, or something silly…
You see, we had setup reserved private IPs on Azure to assign internal IP addresses to each of the cluster nodes. And this was working great. Node 1 was using 10.2.1.10, and node 2 was using 10.2.1.6. On top of that, I had setup the cluster to also use a DHCP address, blindly assuming Azure would give it one.
NO NO NO NO NO. This was not the case.
When the cluster was first setup, it used DHCP on the active node to get the address of 10.2.1.10 – the same address as the cluster node itself. So when that node failed over as a result of me shutting down the servers, the cluster tried to renew the lease for 10.2.1.10; and subsequently assigned it to the active cluster node, now node 2.
Mystery solved.
Pro tip: don’t use DHCP to assign your cluster IP on Azure. Set a static IP address!!!!