Problem VMWare ESXi and drive access

Discussion in 'Virtual and Cloud Computing' started by Josiahb, Aug 3, 2010.

  1. Josiahb

    Josiahb Gigabyte Poster

    1,335
    40
    97
    Alright, before I go off half cocked and get made to look daft.

    The Setup

    We have a VMWare ESXi host running 4 Windows server VMs (1x2008 server running Terminal Services for ~20 users, 2x2003 servers running IIS for a .net web app (live and test) and 1x2003 running SQL Express 2005 with the DBs for the aforementioned web app and our internal intranet). Its running 2xQuad-Core AMD Opteron 8346 HE CPUs and 16GB RAM with 2x300GB disks in a mirrored array.

    The Problem

    The web app mentioned above regularly runs into a brick wall and grinds to a halt for the 50 odd people accessing it, both the web server and the DB server have access to 2 cores each and as much RAM as they can actually use and I'm seeing no particular spikes on either CPU or Memory usage. The company responsible for the app itself aren't seeing any problems with anyone else using it (and I have to believe that as some of their other clients can bring a fair amount of weight to the argument if things aren't working).

    The only thing I seem to be able to tally up with it at all is some heavier disk access on the terminal services VM.

    The Question

    Am I likely seeing the effects of disk access problems as a result of running only 2 disks in a mirrored array? If I am would you recommend a) rebuilding the array as RAID 5 with some more of the same disks or b) swapping out the HDDs for SSDs?
     
    Certifications: A+, Network+, MCDST, ACA – Mac Integration 10.10
  2. onoski

    onoski Terabyte Poster

    3,120
    51
    154

    It looks like there is definately an underlying performance issue but it's not showing up on your CPU/Memory usage. I would suggest you check the log files as well as enable HA.

    Cheerio:) and lets us know if you get any more info on what issue might have been.
     
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell
  3. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Impossible to tell without running some disk tests, but I would never run an ESXi box with local storage for production systems on RAID1. RAID5 with a minimum of six spindles, a decent RAID controller and BBWC. S*** - my test lab at home has five disks in it! What speed are the disks? Did you do any tests with IOMeter or something similar prior to chucking everything on this box? If you can't afford a SAN (and I'm guessing you can't, since you don't!) then put as many disks in as possible - increasing spindle count will increase your IOPs. Again, though, without a good RAID controller, this might not make much difference.

    How exactly would that help?
     
    Certifications: A few
    WIP: None - f*** 'em
  4. onoski

    onoski Terabyte Poster

    3,120
    51
    154

    High available, for redundancy is what I mean as it wasn't clear what exactly was causing the problem. However, your take does make a lot of sense as he might have to take into account the RAID setup and controllers etc.
     
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell
  5. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    That isn't going to be any use when he's only got one host... :eek:
     
    Certifications: A few
    WIP: None - f*** 'em
  6. onoski

    onoski Terabyte Poster

    3,120
    51
    154

    Thanks Zeb, only just noticed he was talking about ESXi and not ESX servers the big daddy's for hosting VM's to fully use and ultilise the functionalities of HA etc.

    Glad we have you on this forum, much appreciated and always read your post with anticipation:).
     
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell
  7. dales

    dales Terabyte Poster

    2,005
    51
    142
    from josiahb's post I am assumed he has one standalone esxi box not connected to vc or anything, HA can exist with ESXI as HA is a vc component. HA would not help in this instance as HA only restarts failed vm's (or as policy defines), your probably thinking of DRS.

    I would guess from above the same as zeb sounds a like theres a bit of disk thrashing going on, if the other usual suspects are showing normal. If theres no budget for external storage you might want to create a seperate mirror to host the ts server by itself.
    Another though is how many uplinks do you have in the esxi box are they all running at 1Gb, you dont mention that other vm's experience network issues when this happens but I thought I'd mention it.
     
    Certifications: vExpert 2014+2015+2016,VCP-DT,CCE-V, CCE-AD, CCP-AD, CCEE, CCAA XenApp, CCA Netscaler, XenApp 6.5, XenDesktop 5 & Xenserver 6,VCP3+5,VTSP,MCSA MCDST MCP A+ ITIL F
    WIP: Nothing
  8. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    You're confusing me now. HA works perfectly well on ESXi as well as ESX. It's the fact that he's only got one host that makes it useless in his situation - not the 'flavour' of ESX he's using.
     
    Certifications: A few
    WIP: None - f*** 'em
  9. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Good point re: the networking - I'm guessing your iSCSI traffic is on a seperate VLAN, right? If it wasn't, you;d probably have experienced other issues before now.
     
    Certifications: A few
    WIP: None - f*** 'em
  10. onoski

    onoski Terabyte Poster

    3,120
    51
    154

    Thank you, well am still an amateur in VMware and yes did mean DRS in retrospect, dynamic resource schedule.
     
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell
  11. craigie

    craigie Terabyte Poster

    3,020
    174
    155
    Only just read this post, but would have to agree with Zeb, I would look at your disk subsystem.

    Just the average disk queue lengths and the I/O's as well

    What speed are the drives? 15K SAS? Also, what RAID Controller do you have, I'm assuming its hardware based? Do you have the ability to add more drives?

    If you discover that it it is the disk subsystem then I would look at defraging the SQL Database and the local hard drives as well, as a well known supermarket says 'every little helps'.
     
    Certifications: CCA | CCENT | CCNA | CCNA:S | HP APC | HP ASE | ITILv3 | MCP | MCDST | MCITP: EA | MCTS:Vista | MCTS:Exch '07 | MCSA 2003 | MCSA:M 2003 | MCSA 2008 | MCSE | VCP5-DT | VCP4-DCV | VCP5-DCV | VCAP5-DCA | VCAP5-DCD | VMTSP | VTSP 4 | VTSP 5
  12. onoski

    onoski Terabyte Poster

    3,120
    51
    154


    Sorry meant to refer to DRS all along, which judging from just one host running and assuming VC is out of the equation would limit these functionalities referenced initially. However, yep get your point and still going along with the fact this might be a storage issue.

    Well at least until the OP replies with more feedback, thanks again for your valid input:)
     
    Last edited: Aug 3, 2010
    Certifications: MCSE: 2003, MCSA: 2003 Messaging, MCP, HNC BIT, ITIL Fdn V3, SDI Fdn, VCP 4 & VCP 5
    WIP: MCTS:70-236, PowerShell
  13. dales

    dales Terabyte Poster

    2,005
    51
    142


    Sorry zeb but I dont think Iscsi is mentioned either sounds like a complete one box job to me.
     
    Certifications: vExpert 2014+2015+2016,VCP-DT,CCE-V, CCE-AD, CCP-AD, CCEE, CCAA XenApp, CCA Netscaler, XenApp 6.5, XenDesktop 5 & Xenserver 6,VCP3+5,VTSP,MCSA MCDST MCP A+ ITIL F
    WIP: Nothing
  14. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Ha! There's me calling out Onoski for not 'getting' the fact the dude only has one server... then making the elementary mistake of not realising he doesn't have a SAN either :biggrin

    I guess we just can't get our heads round putting a production TS VM on a local box with no redundancy, and no shared storage!
     
    Certifications: A few
    WIP: None - f*** 'em
  15. SimonD
    Honorary Member

    SimonD Terabyte Poster

    3,681
    440
    199
    You definitely need to start running some disk benchmarking, there are a number out there but IOmeter is usually one of the most frequently used.

    One thing that hasn't been made clear in all of this is the disk subsystem, please tell me that you are at least running on SAS drives rather than 7200 rpm SATAs? If you're not then you really need to start looking at either using external (Openfiler, Drobo Elite or Freenas solutions) or perhaps adding some additional internal storage using an additional controller and tiering your storage as best you can. Unfortunately if you're using the same disk subsystem across the board (for ESXi as well as all of the VM's) it's not a surprise that you're starting to have issues.
     
    Certifications: CNA | CNE | CCNA | MCP | MCP+I | MCSE NT4 | MCSA 2003 | Security+ | MCSA:S 2003 | MCSE:S 2003 | MCTS:SCCM 2007 | MCTS:Win 7 | MCITP:EDA7 | MCITP:SA | MCITP:EA | MCTS:Hyper-V | VCP 4 | ITIL v3 Foundation | VCP 5 DCV | VCP 5 Cloud | VCP6 NV | VCP6 DCV | VCAP 5.5 DCA
  16. Josiahb

    Josiahb Gigabyte Poster

    1,335
    40
    97
    They are 15k SAS drives so we're at least getting that right!

    RAID controller is hardware based (can't tell you which manufacturer ir anything unfortunately) and there is plenty of room for more drives, another 6 bays in fact. I'll investigate IOMeter and see if I can get some more conclusive stats from the box.

    Oh and its got 3 Gigabit NICs, so network bandwidth shouldn't be an issue.

    I've always been a bit dubious of our setup, but at the time I didn't have the knowledge necessary to push it and change anything.
     
    Certifications: A+, Network+, MCDST, ACA – Mac Integration 10.10
  17. Josiahb

    Josiahb Gigabyte Poster

    1,335
    40
    97
    it not quite no redundancy... we've got another identical server sat ready should the main one fail....

    Anoyone who points out that we could have spent the big wodge of cash we have on these two huge lumps of tin far more wisely is preaching to the choir....
     
    Certifications: A+, Network+, MCDST, ACA – Mac Integration 10.10
  18. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Getting the disk subsystem right is the most crucial part of chucking any non-SAN based ESX/ESXi box together. Your problem now is going to be rebuilding from scratch with your production VMs on the server and no second box for redundancy.

    At a minimum for a production box running its own storage you need a mirrored pair for esxi itself (overkill considering the tiny footprint of esxi, but you need the redundancy since you only have one host, no shared storage or motion). You also need at least a raid5 array for the vmfs - and need to size the speed and number of disks in line with your max iops (you should have collected these in the planning stages of your implementation)

    By eerie coincidence, just this last week I finally got our vmfs at work off of a 7.2k SATA shelf and onto a 15k SAS one. Budgets being what they are, when it was put in originally three years ago the company didnt have the capex to get a faster SAN in. Now its been done, the performance increase has been awesome - and with Exchange its been nothing short of miraculous.
     
    Certifications: A few
    WIP: None - f*** 'em
  19. Josiahb

    Josiahb Gigabyte Poster

    1,335
    40
    97
    Thanks for the feedback guys, its all really helping me put this together. We've got a meeting with the outsourcing company tomorrow and I'm putting together some tough questions to get to the bottom of this.
     
    Certifications: A+, Network+, MCDST, ACA – Mac Integration 10.10
  20. Josiahb

    Josiahb Gigabyte Poster

    1,335
    40
    97
    Well.... a kind of victory has been achieved...

    We're going to throw a couple of SSDs in and shift the web app and databases to them while leaving TS on the original 15k drives. Not quite the solution I was aiming for but they do at least appear to have run the numbers on this one so I'll have to take their word for it.
     
    Certifications: A+, Network+, MCDST, ACA – Mac Integration 10.10

Share This Page

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.