VMWare HA issue - solved after two months!

zebulebu · Dec 23, 2008

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

I've had a poxy little problem that has been gnawing away at me for a couple of months now - finally resolved it today, and am recording the issue here for posterity in case anyone else comes across the same issue.

I have a cluster of three hosts - two of which were put in to start with, the third a few weeks later. Ever since I put the third in and configured HA on it I've been getting, regular as clockwork at the top of each hour, a lovely little red warning icon over host three together with an accompanying ESX event which states that the host has 'an error'. This then clears itself without problems about 30 seconds later. I've not seen any obvious adverse effects, the host is still functional and anything going on (VMotion events etc) continues without interruption and a persistent ping to the host shows no dropped packets. This post on VMTN describes the problem perfectly, so I at least knew it wasn't just me that had it!

After ferretting around in the options for HA on the cluster I discovered the 'advanced' tab - which allows you to set various parameters relating to the isolation response ESX uses to determine whether a host is unreachable (and, therefore, dead to the cluster). After doing some reading I have found that multiple COS connections can cause problems when VC attempts to determine whether hosts are still alive. Like a good little VMWare soldier, all my hosts (natch) have two COS connections for failover (mainly because I got sick of the 'no management network redundancy' nag at the top of the screen whenever I fired up the VI client). So, after much beard scracthing, I added the following two entries to my cluster:

das.isolationaddress (set to DG)
das.usedefaultisolationaddress (set to false)

Hoorah! 'Error' resolved - no longer get that horrible red triangle every hour. It seems that there is a bug in VC (I'm running 3.5u2) which, when running multiple COS connections on hosts, results in VC convincing itself (albeit temporarily) that one of the hosts is offline. If anyone else is experiencing this problem they might want to look at these two (as far as I know) undocumented hacks. Also, if your DG blocks ICMP (as many will) it might be worth putting a different address in there that you know is pingable - say an internal switch management IP or similar.

onoski · Dec 23, 2008

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

Thanks for sharing Zeb, helpful

Log in or Sign up

VMWare HA issue - solved after two months!

zebulebu Terabyte Poster

onoski Terabyte Poster

Share This Page

Navigation

Popular Forums

Useful Links

Log in or Sign up

VMWare HA issue - solved after two months!

zebulebu Terabyte Poster

onoski Terabyte Poster

Share This Page

Useful Searches