HP switches SUCK (well, those with earlier firmware anyway) :o)

Discussion in 'Virtual and Cloud Computing' started by zebulebu, Jul 11, 2011.

  1. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    I recently upgraded my infrastructure at work from ESX 3.5u4 to vSphere 4.1. The original reason for this was to take advantage of vlan tagging (I only used to have to present 2 vlans to the cluster, which, with 4x nics in each host gave me enough ports to have 2x uplinks per vswitch for redundancy, but since bringing in a third vlan I've run out of nics).

    However, after upgrading a couple of the hosts I started to see problems vmotioning between them. This had never been an issue before - I'd never had a single vmotion issue previously, but randomly after migrating the VMs would lose network conenctivity. Straight away I thought it was an incompatibility between the two different flavours of host, but an exhaustive testing matrix seemed to indicate a wider problem - I was now even getting problems vmotioning between hosts on the same version. This, and the fact that sometimes after exactly five minutes the VM's networking would suddenly spring back into life, got me thinking about the switch. Sure enough, interrogating the CAM table after a 'failed' vmotion event, I could see that the VM was still associated with the original physical port - meaning that, for some reason, the Gratuitous ARP that is triggered during a vmotion wasn't being acknoledged by the switch.

    So in the end, I upgraded all the hosts in the cluster to 4.1. VMWare have confirmed that there is nothing wrong with the config of the hosts in the cluster, I have checked the network config on each host and switchport and reverted back to non-vlan tagging (i.e. each host's vswitches are connected to a dedicated uplink native to the vlan that's being presented). Still getting the same problems, which means it's definitely the switch. luckily, we're getting a new core switch in soon for redundancy anyway, so I'll be able to patch across to that then upgrade the firmware on the old switch to bring it up to scratch without ridiculous amounts of downtime.

    Interestingly, I run my DR site on a different HP switch (5308XL - the problem switch is an 8212ZL) that has an even older version of firmware but doesn't have this issue (RARPs are processed correctly whether the ports are tagged, untagged, trunked or native). I think the problem with the 8212ZL must lie in the funky L3 crap it does. There's probably a bug in the code that means RARPs don't work correctly - HP have confirmed that the ARP part of the stack was completely rewritten on the branch of firmware we're on in the version above.

    Trawling through the VM communities, this appears to be a problem with older Cisco IOS versions as well, so any of you running either an old IOS on Cisco, or old HP firmware, be advised that the networking is probably going to cause you problems in a vmware environment :)
     
    Certifications: A few
    WIP: None - f*** 'em

Share This Page

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.