When good vCenters go bad

The idea of virtualizing the vCenter server is not new. I believe it was version 4.x that really started to push the virtual vCenter hard (eat your own dog food approach). 5.x gave us the Linux vCenter virtual appliance. Even with the virtual appliance, there are special considerations to keep in mind when having a virtual vCenter. Although resource requirements have changed since ver 4.x, best practices around creating and placing the virtual vCenter have not really changed. Typically it comes down to understanding your vSwitch configurations when it comes to getting out of a jam with vCenter. In the past some have relied on vCenter server Heartbeat, but that is EoA as of June 2nd 2014.

Mandvis has a couple of good posts on recovering a vCenter during an outage and also special considerations around using hardware ver 10 on your vCenter server.

I would also like to point out a couple of other scenarios to keep in mind when placing your virtual vCenter server on a host and when it comes to recovery during an outage.

Scenario 1:

You have a blade chassis with different fabrics. Fiber, 10 GB and 1GB management. Virtual machines are connected to the 10GB fabric and host management is connect to 1GB fabric. Fiber channel storage is the primary storage for virtual machines, which traverses the fiber fabric. NFS volumes are mounted to house ISO files and templates, which traverses the 1GB network. I had a situation where the network admin used 4 uplinks from each 1GB fabric and properly split them between upstream switches. This would be a proper design (see diagram below). But, instead of bonding the four 1GB cables from each switch, only 1 cable out of 8 was active to the upstream switch. From a blade perspective, all NICs look active. So when we lost the network on the upstream switch, we lost management to the entire enclosure hosting VMware blades.

blade connections

This also effected the vCenter server that had a CD-ROM attached ISO file. The NFS mount was over the 1GB network. This caused the VM to “pause” with a warning message…

Message on vCenter01: Operation on CD-ROM image file
/vmfs/volumes/16b2bd7c-1d7757ef/VMware/VMware-VIMSetup-all-
5.5.0-1991310-20140201-update01.iso has failed. Subsequent
operations on this file are also likely to fail unless the image file
connection is corrected. Try disconnecting the image file, then
reconnecting it to the virtual machine’s CD-ROM drive. Select
Continue to continue forwarding the error to the guest operating
system. Select Disconnect to disconnect the image file.

As you can see, the VM would not resume until action was taken on the CD-ROM from the host console. This required knowing which host the vCenter VM lived on. It is still best practice to create a DRS rule to keep the vCenter VM on a known host (sometimes the first host in the cluster is best). We could not acknowledge this prompt from vCenter, because the VM was in a paused state. Once the message was acknowledged from the VM, vCenter came out of its pause state.

Scenario 2:

The host freeze. Not a PSOD, but the hypervisor going into a hung state. I have only seen this happen once. Even from the DCUI you are unable to restart the management services. But virtual machines continue to run. You are unable to log in to the host console to take action on any virtual machines. It is in a “zombie host” state. I’m not sure if host isolation elections even kicked in.

We accepted the only course of action was to pull the power cord on the host server to force a fail over. With doing this, HA should kick in and fail over the virtual machines. But even after powering off the host, the virtual machines stayed registered to the host. Even a manual “unregister” was not accepted while the host was powered off. The host would not release the locks on the VMDK files. We had to remove the host from vCenter and then re-register the virtual machines to new hosts in the cluster. So it may have been a combination of the vCenter DB and the host isolation response. This was the first time I have seen HA not work properly. Even VMware support could not pinpoint the issue.

So what do you do when you are in a scenario like this and vCenter is on the host that is in a hung state and will not release the locks on the VMDK files even after a host is powered off? I would image you would need to do something nasty to the storage volume to release those connections. Or possibly restore vCenter from backups to another host server.

Of course there are other recovery scenarios you have to keep in mind with vCenter… DB becomes full, OS corruptions, miss-configuration by other admins (like deleting the wrong SQL tables),  no DB backups or issues with any of the other components (like SSO) installed on your vCenter server.

 

Leave a Reply

Your email address will not be published. Required fields are marked *