A practical guide for VMware HA

I would like to give a brief overview of VMware HA as it stands today in the 4.1 release. Let me first say that VMware high availability (HA) and VMware vMotion are two separate functions. VMware HA is a restart of the virtual machines and vMotion is the migration of a virtual machine from one host to another. One question I get often is "Don't the virtual machine just vMotion over to another host server in the event of a host failure?". In the event of a failure, vMotion does not get involved. HA will restart the virtual machine on an available host in the cluster, provided there are resources available for your virtual machine to use. There is a lot to explain about this process, but we will get there soon. VMware HA is a big topic and you could write a book about it, but let’s look at some of the key features of High Availability according to VMware:

  • Automatic detection of server failures. VMware HA automates the monitoring of physical server availability.  HA detects physical server failures and initiates the new virtual machine restart on a different physical server in the resource pool without human intervention.
  • Automatic detection of operating system failures. VMware HA detects operating system failures within virtual machines by monitoring heartbeat information. If a failure is detected, the affected virtual machine is automatically restarted on the server.
  • Smart failover of virtual machines to servers with best available resources (requires VMware DRS). Automate the optimal placement of virtual machines restarted after server failure.
  • Scalable high availability across multiple physical servers. Supports up to 32 nodes in a cluster for high application availability. VMware HA has the same limits for virtual machines per host, hosts per cluster, and virtual machines per cluster as vSphere.
  • Resource checks. Ensure that capacity is always available in order to restart all virtual machines affected by server failure. HA continuously and intelligently monitors capacity utilization and reserves spare capacity to be able to restart virtual machines.
  • Proactive monitoring and health checks. VMware HA helps VMware vSphere users identify abnormal configuration settings detected within HA clusters. The VMware vSphere client interface reports relevant health status and potential error conditions and suggested remediation steps. The Cluster Operational Status window displays information about the current VMware HA operational status, including the specific status and errors for each host in the VMware HA cluster.
  • Enhanced isolation address response. Ensures reliability in confirming network failure by allowing multiple addresses to be pinged before declaring that a node is isolated in the cluster.

____________________________________________________________________________

Automatic detection of server failures.

This key feature describes the very nature of VMware High Availability. This monitoring takes place between all nodes in a VMware cluster, with one of 5 being the (Active) master in control of the other 4 primaries in the cluster. Yes, the VMware HA environment has a hierarchy just like any organized structure. There are secondary nodes and Primary nodes. One of those 5 primary nodes acts as a "master". What happens if the master dies? Well, one of the remaining 4 nodes in the cluster is then elected as the master and a secondary node is then brought into the group to become a primary. This maintains the hierarchy of 5 primary nodes. How do you tell which hosts in your cluster are primary nodes? As of the 4.1 release, you can tell from the GUI which are primary and which are secondary! I am a GUI fan, but I also like the CLI (cash line interface). Under the cluster summary tab, there is a "cluster operational status" pop-up that shows you the role of each host and any HA configuration issues. But, this will only display the role of the host if there is an HA configuration issue with a host in the cluster. If you have no issues, you get a blank gray screen like the one listed below. It would have been nice to get a good clean list from here of the roles for each host.

You may also run a PowerCLI script to pull which hosts are primary and secondary. All hosts and vCenter must be at 4.1. It would look something like this…

$clusterName = "your cluster name here" Get-Cluster $clusterName | %{ $info = $_.ExtensionData.RetrieveDasAdvancedRuntimeInfo() $_ | Get-VMHost | %{ $row = "" | Select Name,Role $row.Name = $_.Name $row.Role = &{if($info.DasHostInfo.PrimaryHosts -contains ($_.Name.Split('.')[0])){"primary"}else{"secondary"}} $report += $row } } $report

A big thanks goes out to LucD on the VMware communities for slapping this together for me. You can also dive into the CLI. You can log into any host in your cluster. So what if I only have two nodes in my cluster or only 5? Well, all hosts are  then considered a primary with one of those being the master. Once you reach your 6th host in the cluster, that node then becomes your very first secondary node. Like I stated in the beginning, there is a monitoring process that takes place between all the hosts in the cluster. This monitoring process is independent of vCenter. Each host knows who is one of the 5 primaries and who is the secondary. This takes place thanks to the HA agent installed on each host when you configure a cluster for HA. This sometimes gets confused with the vpxa agent, which is configured when you initially connect a host to vCenter. The vpxa agent (or vmware-vpxa service) talks to the hostd service which is relaying information from the ESX kernel. Hostd is built in with ESX, but vpxa is installed when you connect a host server to vCenter. I will dive into all the services running on an ESX host in a later post. Remember though, the information that the HA agent retains is stored in RAM. Now that I have explained what agent is communicating all this HA information, let's take a look at where we can view the results of this communication and how we can manipulate it. Note, these methods are unsupported and you can try them at your own peril. Another warning, creating more complex clusters with advanced settings equates to more information you need to track, also equating to more work, which may lead to less sleep.  Why would I want to go through all this hassle to configure complex clusters anyway? Maybe you have a 32 node cluster and you want some way to manage which hosts are primaries in your cluster. You may also have a blade infrastructure in which you need to ensure primary hosts stay separated across blade chassis. If you keep all of your primary hosts on one particular blade chassis, you have whats know as a "failure domain". How can I change which hosts are the primary and secondaries? 1. Today in vSphere 4.1, there is an unsupported method in the GUI (thanks to Mr Duncan at Yellow Bricks suggestions) that allows you to specify your primary hosts in the cluster.  This advanced HA option is das.preferredPrimaries = ESXhost1, ESXhost2, ESXhost3, ESXhost4, ESXhost5. You will probably want to use the FQDN of the hosts for proper resolution. You may also specify the IP address of your host. Again, this is an unsupported option. 2. The supported way is actually the most labor intensive way. Especially if you have a 32 node cluster! Like I mentioned before, when the first 5 nodes are added to an HA cluster, those are considered the primaries. All other nodes in the cluster would be considered secondary. So, in a 32 node cluster, that would give you 27 possible secondaries in the cluster. So if a primary host dies, which of those possible 27 hosts is promoted to a primary? That election process is random. To find out which one is now a primary, you would need to cat /var/log/vmware/aam/aam_config_util_listnodes.log to find out which host is now a primary. So if I want to elect a certain host in my cluster as a primary, I would need to enter and exit maintenance mode on hosts, then check the listnodes.log, to see which node is now the primary. But how can you keep the previous attempts at this method from becoming the primaries again? You can't. You either have to leave them in maintenance mode until the correct host is elected as a primary, or you can try option 3. But also note, the election process takes place when you reconfigure HA from the cluster level or when a host is removed / disconnected from the cluster. 3. The first 5 primary nodes in the cluster is a soft limit. This doesn't mean you should go bananas and add 32 primary nodes to your cluster. This option falls under the category of "unsupported" and "More complex environments = more overhead = less sleep". Did I mention I like to get my well earned sleep at night, knowing that my clusters are safe and sound in the datacenter? 🙂 Enough of this sleepy talk! To promote a 6th node (method is unsupported), you can open the shell under "/opt/vmware/aam/bin # .Cli". From the prompt, you can issue the command "promoteNode host1.burdweiser.com". Of course, you would enter the FQDN of your host server. To demote of the primaries, issue the command "demoteNode host2.burdweiser.com". These options are the only way to view and manipulate the primary and secondary nodes in the cluster. In a later post I will go over slot sizing, the origins of VMware HA from Letago and advanced options from the cluster level.

____________________________________________________________________________

Automatic detection of operating system failures.

I do not recall how long this feature has been around. By default this option is disabled. You can configure this option for all virtual machines or just certain virtual machines. Keep in mind that the heartbeat for this HA feature is not sent via the NIC or any other virtual device, it is relayed to the hostd service on the ESX host from VMware Tools in your VM. If hostd does not receive heartbeats from VMware Tools, it can also check the disk I/O for the VM as a secondary measure. The disk I/O check is a configurable interval. This is essentially HA for the virtual machines. VM's can be restarted on the same host or on another host. For a detailed guide on the options you have with VM and application monitoring, see the VMware Availability Guide.

In release 4.1, you can now monitor applications within your virtual machines! But "you must first obtain the appropriate SDK (or be using an application that supports VMware Application Monitoring) and use it to set up customized heartbeats for the applications you want to monitor". To do this, you must use a tool like Hyperic. I've gotta be honest, when I first saw this feature I thought I would be able to automatically restart services via VMware tools. That is not the case, you must purchase an application that supports VMware Application Monitoring. It is a nice addition, but requires another product to use it.

 

____________________________________________________________________________

Smart failover of virtual machines to servers with best available resources

If you have sized your clusters properly and you have available resources to restart virtual machines after a host failure, Distributed Resource Scheduling (DRS) will relocated virtual machines to another host in the cluster if it is deemed necessary to balance the cluster according to your migration threshold settings. HA will restart your virtual machines on other hosts in the cluster according to the admission controls you have set. This HA failover has the potential to create an imbalance of resources used across your cluster (RAM and CPU). DRS gathers metrics over time (every 5 minutes) to gauge any imbalance in the cluster. HA actually has a new feature in release 4.1 that helps curve this resource fragmentation, which we will talk about soon.

Since we are talking about new features in the 4.1 release, one of the newest and greatest features is the affinity / anti-affinity rules in DRS. You can now create groups for your hosts to separate VM's. Before, all you could do is create rules to either keep certain VM's together or separate. This is something that I believe all blade architectures have been waiting on for a long time, especially if you are looking to keep VM's separated by blade chassis.

Confused? Well, if I have a failover application XYZ and I want to make sure I am fully redundant across my blade architecture, I need to make sure they stay separated across different blade chassis. Let's say application XYZ is installed on two VM's and the application itself has a built in failover feature. If VM1 with XYZ application in chassis 1 fails (or the host fails), then VM2 with XYZ application needs to take over. If VM2 is sitting on chassis 1, then you just lots the XYZ application (and your company could be losing money by the second!). But, if you had VM2 placed on chassis 2, then everything would be safe! Along comes the new host groupings in DRS. You can tell DRS to keep VM's separated across these host groups. So in blade 1 I have 6 hosts, I create group 1. In chassis 2, I have 6 hosts for group 2. All hosts are a part of the same cluster. You simply create a new VM anti-affinity rule to say "keep these two VM's separated across these groups".

Keep these two things in mind that when creating these affinity / anti-affinity rules in DRS.

  1. Let's say VM1 (with application XYZ) is on a host that fails, then HA will restart VM1 on the next available host in the cluster. Unless you have specified a failover host in your HA admission controls. So this means that VM1 could be restarted on the very same host that VM2 is running on.
  2. DRS will evaluate the cluster after a period of 5 minutes to check for an imbalance in resources, and rules! So, if VM1 happens to be restarted on the same host as VM2, DRS will move that VM back to chassis 1 (or group 1).

Pretty magical huh? Before, there was a little overhead to keep track of where the VM's might have moved to, even though you create anti-affinity rules for the VM's. You could of course create a PowerCLI script to run and report the location of certain VM's. It would look like this: Get-VM | Select Host, Name | Sort Name, Host. That will just give you a quick list. But you might want to use something a little cleaner if you are dealing with hundreds or thousands of VM's. The timing process of a VM failover in an HA event has not really changed in release 4.1. Just to review – If your cluster setting is "shut down" for virtual machines (default in 4.1) during a host isolation response, the VM will be restarted at the 15 second mark on another host in the cluster.

____________________________________________________________________________

Scalable high availability across multiple physical servers

This simply states the fact that you can have 32 host in a VMware cluster. For the full list of maximums within the cluster, please visit the VMware Configuration Maximums document. Even if you max out those pretty new host servers, you still have to keep in mind the number of virtual machines you can host in the cluster and scale the resources for each virtual machine appropriately.

Pay close attention to the HA admission controls that make sense for your environment.  Don't take the lazy road and choose to disable HA admission controls. Most admins do this when rolling out the first clusters in vCenter but forget to go back and scale things appropriately. Disabling the HA admission controls allows you to "Power on VM's that violate availability constraints". Doing this is like overcrowding a train.

Yes, you can overload a host just like the examples in these pictures. You can pack a ton of VM's on a host, but things will slow to a crawl. By default, only 32 VM's will power up on a host at one time. Unless you have created restart priority levels for your VM's. HA will continue to restart remaining VM's (if you have enough left over resources in your cluster). DRS will eventually even things out if needed. Be careful how you over provision! Just because you see free space in your cluster doesn't mean you should take it all!

____________________________________________________________________________

Resource checks

Before the 4.1 release, a failed over VM could be granted more resource shares than what was available on the host, causing a real drag on resources until DRS balanced things out. Remember that HA calculates resources based on VM's that are restarted after an HA event.

To help avoid the over crowded train scenario above, VMware retooled the way it does the HA failover. Now in the 4.1 release, before the failed virtual machine is restarted on another host, HA will actually create a "test" virtual machine identical to your failed VM to test for available resources. When HA determines that resources are available for this test VM, it is deleted and your failed VM is restarted on the host. This process allows for better placement of failed virtual machines and reduces fragmentation of resources.

There are not many details in any documents on this process to create and destroy a test VM. What I've been able to find out so far is that extra storage is not required, this process is just a simulation by HA and the test VM is not even powered on.

 

____________________________________________________________________________

Proactive monitoring and health checks

As mentioned before, the "Cluster Operation Status" provides a clear view of any HA misconfiguration issues in the cluster. During the 10 second window (das.sensorPollingFreq option) that the HA agent takes to monitor the health of the cluster, this process will report to vCenter any issues with HA.

____________________________________________________________________________

Enhanced isolation address response

This feature is not new to release 4.1. It was introduced in vCenter 2.0.2. This function is used by the ESX (or ESXi) host server when it is unable to contact other hosts in the cluster. All hosts in a cluster send "heartbeats" to each other every second. If one or more of the hosts in the cluster do not receive a response from it's isolation address (default isolation address is the gateway of the Service Console) after 13 seconds, the host considers itself isolated.It is on the 13th second that the host detects it is possibly isolated and than it will ping the gateway, on the 14th second if it is isolated it will trigger the isolation response ( so das.failuredetectiontime -2 and -1.) This is important as when you increase the das.failure detection time to 20 it will be the 18th and the 19th second. You can have up to 10 isolation addresses, but it is a good idea to increase the "das.failuredetectiontime" to 20 seconds. For every isolation address you add you will need to add 2 seconds at a minimum. If you use 2 in total 20 seconds is enough, if you increase it to 4 you should have at least 25 seconds in total as the das.failuredetectiontime.

What happens in this case? The default rule for VM's is to shut down (a graceful shutdown if VM tools is installed) and the virtual machines are restarted on other hosts in the cluster that are still considered alive. This is of course only possible if one of the 5 primary hosts in your cluster is not in an isolated state. If the VM's are not finished with the shut down process after 300 seconds, the VM's are powered off (like pulling the power cord on a physical box). This is a configurable value in the das.isolationshutdowntimeout.

This isolation address VMware is referring to is a redundant address that can be pinged in the event heartbeats are no longer received over the service console (management network). This address is contacted on the 12 second mark, just three seconds before VM's are configured to be restarted on other hosts in the HA cluster. Now if you changed your VM isolation response from "shut down" to "leave powered on", the host will retain the VMFS lock on the vmdk files for the VM. There are some considerations that will be address later for this, but your storage configuration can have different results if you are using NFS, FCoE or iSCSI, which leads to the infamous "Split Brain" scenario. Why do this? Perhaps your host is really not down and it cannot ping your isolation address. If this is the case then your host will still consider itself isolated, but VM's will continue to run. The other hosts in the cluster will attempt to start the VM's from an isolated host, but the file locks in VMFS will prevent this. But if the host is truly down, the locks on the VM's files in VMFS will be release and the VM's will be restarted on other hosts.

The important thing to take away from this highlighted feature is that HA heartbeats have a redundant feature – an isolation address that can be reached, "just in case". How do you configure an isolation address? You simply create a das.isolationaddressX in the HA advanced properties. It is recommended to have a secondary service console on all hosts in the cluster. Redundancy everywhere is not a bad thing.

6 thoughts on “A practical guide for VMware HA

  1. Hi james,
    This guide is very good. thanks you !

    I have some issue with VMHA. Could you please help me ?
    I would like to know if it's possible for HA to migrate VM to another host when some Vswitchs (VLAN production or ISCSI vSwitch) are down (switchs failure or cable unpluged) and Service console not down.
    Thanks,
    Regards.

    • Assuming you have VM’s only on the vSwitch in question, as long as your HA network (service console by default) is reachable, HA will restart VM’s on other hosts regardless of whether your VM’s vswtich NICs are down or up. Your VM’s vSwitch connectivity is not a requirement that HA considers when doing a restart, nor is the VLAN ID. As long as the vswitches have the same label, HA will “restart” those VM’s on another host. Remember, HA will not migrate a VM, HA restarts a VM after a host has failed. If you do not have redundant links for your VM vSwitches, consider creating an alarm in vCenter for those connections.

  2. Hi,
     
    is it possible to make HA over 3 vmware servers without using any kind of switch in between. for example connect all 3 servers togeather with direct cable, using 2 network ports on each server? making a ring for sending hearthbeets?
     

    • Remember, HA heartbeats go over the management network. I think it would be possible to do this as a secondary HA heartbeat address. But if you are referring to vMotion so that you can isolate traffic and not require a physical switch, I think it would be possible. But you would be looking at utilizing two separate NICs on each host with no redundancy (4 if you wanna be redundant). I see scenarios all the time utilizing two hosts directly connected to isolate the vMotion traffic, so it would be possible.

Leave a Reply

Your email address will not be published. Required fields are marked *