Resolved SAN help

Discussion in 'Software' started by Theprof, Jan 25, 2011.

  1. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    The last two days we've been having really bad SAN issues. Here's the scoop

    We have two paths from the exchange server to the SAN. Last week one of the paths failed, however the performance was still good. Myself and a colleague of mine troubleshooted the issue and we got it figured out. It ended being an SFP module, we replaced it, rebooted the fiber switch and all is good.

    However since Monday, the SAN has been having serious issues and because both Exchange DB, SQL DB, and File server files are on the SAN all the users are affected.

    Last night we rebooted both the fiber switches and the SAN controllers to no avail. We've used the perfmon to look at Avg.Disk Write Queue Length and all the drives are reporting high maximums such as 90-200. This on regular basis is not even close to being that high, the maximums we've seen are maybe 50 at best.

    The other issues is the exchange queue is queuing up really high like 700, this is obviously do to the SAN issues.

    We have an HP EVA400 SAN and we ended up calling tech support and sending them a bunch of log files.

    Any ideas as to what this could be? I am sure this could be many things but perhaps someone had similar experiences?
     
    Last edited: Jan 25, 2011
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  2. LukeP

    LukeP Gigabyte Poster

    1,194
    41
    90
    Can you access any kind of diagnostics on the SAN directly and cross check against perfmon?

    Average Disk Busy % against disk write queue length.

    You mention you have multiple path's to SAN configured. Is the multipath software working OK? Can you run some tests after hours with one path disabled and only one route to the SAN?

    I do feel you're having network issues and not issues with the SAN itself. VLANs? Jumbo MPU? QoS?

    Good luck and let us know what it was.
     
    WIP: Uhmm... not sure
  3. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211

    We've tried with only one path and the issues are the same. Actually, we've even isolated Exchange to it's own controller and moved SQL and File server stuff to another controller.

    I don't think that it's a network issue because the speed of the network seems to be normal, no high latencies. We looked at the switches and routers and even rebooted one of the routers and no issues there. The exchange works a lot better when it's after hours. It definitely seems like a SAN issue but I am not 100 percent certain.

    Thanks for the help Luke, much appreciated.
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  4. LukeP

    LukeP Gigabyte Poster

    1,194
    41
    90
    So basically what you're saying is that out of nowhere the SAN has become really sluggish for no obvious reason?

    Is it possible that one of the services (SQL, Exchange, whatever else uses it) eats SAN I/O? Can you switch off/move/failover any of them to try to narrow it down? Or do you think it might be a hardware problem? Any hardware problem should be visible through SAN management (soft/web).

    You should be able to cross check diagnostics from SAN against the perfmon disk busy ratio. Have you tried that?

    How many RAID groups do you have? Does this problem happen on different RAID groups or just one?

    And last silly one :biggrin:
    Array not rebuilding, no? :tongue
     
    WIP: Uhmm... not sure
  5. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    That's pretty much what happened, out of nowhere. One thing to note is that my colleague added 6 extra drives to the SAN last week, early in the week. Throughout the week no issues and then all of a sudden, on Monday morning, the issues started happening. Our next step will be to pull out the drives one by one and see if there are any issues. The thing is it can be time consuming as once we pull a drive, we'll need to wait till the remaining disk level out.
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  6. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    In the end it was a bad drive!!! ... The funny thing is, on the SAN interface where it displays drive configs/information all was good. We were able to see the bad drive by pulling the logs.
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  7. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Sounds like a pretty crappy SAN. Who is the vendor?
     
    Certifications: A few
    WIP: None - f*** 'em
  8. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  9. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    Always start with the basic troubleshooting, "Has anything changed on the network?" - "Yes"

    Only kidding mate, glad you got it sorted! :biggrin
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  10. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211

    LOL, well it does seem obvious, however since the SAN software never reported an error about a bad drive, we considered it and even looked at the logs quickly to see if there were any bad drive issues and could not find anything. The only way we figured out it was a drive, was by doing a huge export from the EVA Command View, sending the binary logs to the HP SAN guys and they have readers that can translate the logs and tell us what exact drive is creating the issue.

    What do you think of that?

    On the positive side, I learned a lot about a SAN, although not enough to consider my self a SAN admin!
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  11. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    A good fix all in. Always handy to have that level of support when you need it.

    Either that or try and read a binary log! :biggrin
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  12. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    Might as well use the support, or else why pay them :twisted: In all honesty though, I am not very comfortable with SAN's in general, I am no where near an expert but at least there's always someone who can help. The saying goes, if you don't know the answer, know where to get the answer. Ask me to troubleshoot Exchange, AD, DNS, DHCP, Terminal Services, etc.. I have no issues, but SAN is out of my league at the moment, but I will be learning it down the line when the time is right.
     
    Last edited: Jan 25, 2011
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  13. zebulebu

    zebulebu Terabyte Poster

    3,748
    330
    187
    Yep. Crap. :biggrin

    Use this incident to get your management to invest in a real SAN.
     
    Certifications: A few
    WIP: None - f*** 'em
  14. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    There is always a point when you are not too comfortable working with a new technology or product. At least you have someone to ask, I just have Google! :biggrin
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  15. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211
    Good point, still you always have the feeling where you wish it was you who solved the issue!! :twisted: but gotta give credit where it's do!
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA
  16. Theprof

    Theprof Petabyte Poster

    4,607
    83
    211

    Funny you should say that, my boss really likes HP SAN's, plus they're not very expensive!!

    I have to admit, that in the four years we had that SAN, this is the first time something major like this happened, aside from little things like battery replacement, etc...

    Our exchange was so screwed up that we had 3000 emails in the local delivery queue waiting to be released with a 4 hour email delays!!! what a day, I don't think I ever felt the stress in IT like I felt the last two days...
     
    Certifications: A+ | CCA | CCAA | Network+ | MCDST | MCSA | MCP (270, 271, 272, 290, 291) | MCTS (70-662, 70-663) | MCITP:EMA | VCA-DCV/Cloud/WM | VTSP | VCP5-DT | VCP5-DCV
    WIP: VCAP5-DCA/DCD | EMCCA

Share This Page

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.