1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

VMWare HA issue - solved after two months!

Discussion in 'Virtual and Cloud Computing' started by zebulebu, Dec 23, 2008.

  1. zebulebu

    zebulebu Terabyte Poster

    I've had a poxy little problem that has been gnawing away at me for a couple of months now - finally resolved it today, and am recording the issue here for posterity in case anyone else comes across the same issue.

    I have a cluster of three hosts - two of which were put in to start with, the third a few weeks later. Ever since I put the third in and configured HA on it I've been getting, regular as clockwork at the top of each hour, a lovely little red warning icon over host three together with an accompanying ESX event which states that the host has 'an error'. This then clears itself without problems about 30 seconds later. I've not seen any obvious adverse effects, the host is still functional and anything going on (VMotion events etc) continues without interruption and a persistent ping to the host shows no dropped packets. This post on VMTN describes the problem perfectly, so I at least knew it wasn't just me that had it!

    After ferretting around in the options for HA on the cluster I discovered the 'advanced' tab - which allows you to set various parameters relating to the isolation response ESX uses to determine whether a host is unreachable (and, therefore, dead to the cluster). After doing some reading I have found that multiple COS connections can cause problems when VC attempts to determine whether hosts are still alive. Like a good little VMWare soldier, all my hosts (natch) have two COS connections for failover (mainly because I got sick of the 'no management network redundancy' nag at the top of the screen whenever I fired up the VI client). So, after much beard scracthing, I added the following two entries to my cluster:

    das.isolationaddress (set to DG)
    das.usedefaultisolationaddress (set to false)

    Hoorah! 'Error' resolved - no longer get that horrible red triangle every hour. It seems that there is a bug in VC (I'm running 3.5u2) which, when running multiple COS connections on hosts, results in VC convincing itself (albeit temporarily) that one of the hosts is offline. If anyone else is experiencing this problem they might want to look at these two (as far as I know) undocumented hacks. Also, if your DG blocks ICMP (as many will) it might be worth putting a different address in there that you know is pingable - say an internal switch management IP or similar.
    Certifications: A few
    WIP: None - f*** 'em
  2. onoski

    onoski Terabyte Poster

    Thanks for sharing Zeb, helpful:)
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell

Share This Page