Replacing a RAID array on a Poweredge 2800 server

Discussion in 'Hardware' started by ade1982, Jan 8, 2013.

  1. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    Hi guys,

    Have a Dell Poweredge 2800 server (still our main server) with two RAID arrays (RAID 1 for OS and programs, RAID 5 for data)

    One of the drives on the RAID-1 array is blinking orange, which according to the manual tells me it is about to fail. The RAID card is a Perc 4e/Di card. The drives are UltraSCSI 320 15k 36GB jobbies. I am reasonably sure the equipment is hot-swap.

    What I want to know (and I am pretty nervous about this) is can I just literally yank the drive out, replace it with an identical spare drive which I have ordered, and it should merrily be on it's way to build the new array from the other working drive? Will it wipe the data from both drives (meaning a reinstall of Server 2003), or just copy over the existing data as the array rebuilds?

    How can I watch it rebuild? From the Dell Openmanage software? Do I need to do anything from within there to get it to go?

    Is the system usable when the rebuild takes place, or is it stuck somewhere?

    Any pitfalls to look out for?

    As I say, I am pretty nervous about doing it, as I've not done it before, and I have noone here to consult, but it needs to be done. It's also why I am still awake at gone 12, and sweating profusely!
     
    Last edited: Jan 8, 2013
  2. SimonD
    Honorary Member

    SimonD Terabyte Poster

    3,681
    440
    199
    If I were you I would arrange to do this out of hours, power down the server, remove the failed drive (it's IMPORTANT that you remove the correct drive otherwise you can bork your server OS), replace with the new drive, power the server back on and it 'should' start rebuilding the array.

    You should be able to monitor the array through the OMSA software.

    The system should be 100% usable during the rebuild process and there should be no impact to you or the users during this time (it would be different if you were rebuilding the RAID 5 data array but as this is the OS then you should be good).

    Things to consider now would be that now one drive has failed be prepared for the second drive to fail within the next 6 months or so, of course you may find that everything just works fine until you replace the server but Mr Murphy always likes to throw a spanner into the works.
     
    Certifications: CNA | CNE | CCNA | MCP | MCP+I | MCSE NT4 | MCSA 2003 | Security+ | MCSA:S 2003 | MCSE:S 2003 | MCTS:SCCM 2007 | MCTS:Win 7 | MCITP:EDA7 | MCITP:SA | MCITP:EA | MCTS:Hyper-V | VCP 4 | ITIL v3 Foundation | VCP 5 DCV | VCP 5 Cloud | VCP6 NV | VCP6 DCV | VCAP 5.5 DCA
    dales likes this.
  3. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    Thanks Simon

    I will try to schedule it for Friday evening. The drive that has been shown as degraded is the one that is flashing orange, as I have just done the "blink" on it.

    Is it necessary to power down the server, as surely that's the point of hot plugging? If it's safer, I would rather do it, but I also value my Friday nights (as it's the only time I get with my girlfriend)!
     
  4. dales

    dales Terabyte Poster

    2,005
    51
    142
    Agreed whilst you can hot remove/add a drive, on occasion (very very very rarely) an issue may cause the other drive to stop working as well. TBH I normally yank the drives out with the system up and working and you could if you wanted unplug the original failing drive and plug it back in again to see if it clears the error. Because they are mirrored it's fairly safe to assume as Simon says that both drives have roughly the same life expectancy so get the faulty one changed out quick and make sure your backups are working and recoverable too.
     
    Certifications: vExpert 2014+2015+2016,VCP-DT,CCE-V, CCE-AD, CCP-AD, CCEE, CCAA XenApp, CCA Netscaler, XenApp 6.5, XenDesktop 5 & Xenserver 6,VCP3+5,VTSP,MCSA MCDST MCP A+ ITIL F
    WIP: Nothing
  5. SimonD
    Honorary Member

    SimonD Terabyte Poster

    3,681
    440
    199
    Yes that's the idea of hot swapping but you know Mr Murphy, I would say it depends on how easy you feel doing it. Theoretically there's nothing stopping you replacing the drive now but, and this is why I suggested doing it out of hours, what happens if something does go wrong?? Instead of having no one impacted because the work is carried out during a quiet period you're now having to resolve the issue and everyone in the office is now impacted.

    It's times like this that you're thankful for peer review and CAB's suggesting that work be carried out during times of least impact.

    What I would say is that the whole process of rebooting the server shouldn't take too long, what could is the whole troubleshooting if something goes wrong. If I were in your shoes I would power down, swap disks, power up, monitor the OMSA to ensure that the array is fixing itself and then come in early Monday morning to check that it's completed successfully (assuming it's finished in that time). All told you may well find you have to work an extra 15 minutes or so on the Friday.
     
    Certifications: CNA | CNE | CCNA | MCP | MCP+I | MCSE NT4 | MCSA 2003 | Security+ | MCSA:S 2003 | MCSE:S 2003 | MCTS:SCCM 2007 | MCTS:Win 7 | MCITP:EDA7 | MCITP:SA | MCITP:EA | MCTS:Hyper-V | VCP 4 | ITIL v3 Foundation | VCP 5 DCV | VCP 5 Cloud | VCP6 NV | VCP6 DCV | VCAP 5.5 DCA
    ade1982 likes this.
  6. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    That's great. I think I will do that, and I am very thankful for your advice.

    As I am temporarily covering reception, the hard drive has literally just arrived.
     
  7. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    As said if you pull the drive out and plug it back in the RAID may sync ok – this does happen on occasion.

    I take the drive you have ordered is new and not refurbed?
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  8. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    Yep, we had it replaced under our maintenance contract (I dare say the cost of the new part was more than we pay (Smartpac - check them out, they are dirt cheap and have given great support the twice I have needed them)) and it arrived direct from Fujitsu.
     
  9. SimonD
    Honorary Member

    SimonD Terabyte Poster

    3,681
    440
    199
    Fujitsu??
     
    Certifications: CNA | CNE | CCNA | MCP | MCP+I | MCSE NT4 | MCSA 2003 | Security+ | MCSA:S 2003 | MCSE:S 2003 | MCTS:SCCM 2007 | MCTS:Win 7 | MCITP:EDA7 | MCITP:SA | MCITP:EA | MCTS:Hyper-V | VCP 4 | ITIL v3 Foundation | VCP 5 DCV | VCP 5 Cloud | VCP6 NV | VCP6 DCV | VCAP 5.5 DCA
  10. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    That's what it says!

    Actually, looking at it again it was not direct from Fujitsu, but from a reseller they must use for parts. It's definitely a Fujitsu drive though.
     
  11. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    Well, it didn't go well.

    After work, powered the server off, replaced the hard drive, plugged it back in. Booted, said something about being mismatched in NVRAM to the disk, and either press Ctrl+M or press any key ... pressed the key and it said "diskette not found, reboot" ... so I am presuming it was trying to boot off the blank disk.

    Had a potter around in the Ctrl+M utility thing, but not really doing much more than looking, cos I basically didn't want to feck it up.

    Anyway, decided to put the failing drive back in, booted off that, and while it was up, pulled the dodgy drive, put the new one in it's place, opened OMSA to find that the RAID 5 array was doing a "background consistency check". The disk I replaced was showing as degraded.

    As I was waiting for the RAID-5 array to finish I downloaded a new version of OSMA and once it had finished doing what it was doing, I installed the new version. Anyway, long and short of it ... at no point did I see the RAID1 array rebuilding, but after installing this new version both disks in the RAID array are online (and the disk I replaced is not showing as degraded) and the server is up and running.

    I've not tried rebooting yet ... So what should I have done from Ctrl+M, and am I being a bit premature in thinking the problem is gone? Or is it actually solved?

    Also: I performed a consistency check on the RAID-1 and turned out fine (or at least no reported errors)
     
    Last edited: Jan 12, 2013
  12. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    No reason to power the server off to change the drive :)

    What is the status of the RAID 1 and RAID5 array in Server admin?
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  13. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    As I say, I never once saw the array rebuilding, but it came up as the disk i replaced being "degraded" still, but for some reason I installed a newer version of OMSA and after that it shows everything is ticked green and online. A consistency check on the raid-1 array didn't pull up anything. The RAID-5 array has always showed online.

    Reading around a bit, it seems what reinstalling the software did was almost like a global rescan of the drives, but I don't really trust myself anymore!
     
  14. dales

    dales Terabyte Poster

    2,005
    51
    142
    The array will show as degraded whilst the array is rebuilding, that's quite normal. Normally there is also a blink pattern on the drives to tell you what its doing. Even if its degraded a steady blink for example might show that its rebuilding
     
    Certifications: vExpert 2014+2015+2016,VCP-DT,CCE-V, CCE-AD, CCP-AD, CCEE, CCAA XenApp, CCA Netscaler, XenApp 6.5, XenDesktop 5 & Xenserver 6,VCP3+5,VTSP,MCSA MCDST MCP A+ ITIL F
    WIP: Nothing
  15. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    Yup, check the drives again when you get the chance and see what the LEDs are doing. Hope they are all green :)
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs
  16. ade1982

    ade1982 Megabyte Poster

    566
    12
    52
    Presumably the biggest test will be when I reboot the server to see if it comes back up. I believe the hard drive light was green when I left work on Friday. I can login through the VPN to see statuses now but obviously not blink patterns. I am now at home, and the guy who has to open up is nowhere to be seen today should I wish to do it.

    Don't really want to reboot the server until I am sure that I won't get a "diskette seek failure, reboot the computer" error up when I boot it. If that does happen, I presume it is a case of picking the other hard drive in that RAID-1 array to boot off? Or even could I unseat the replaced hard drive and boot off the other one without it throwing a wobbly, just incase the server doesn't come back up?

    Appreciating all the help.
     
  17. Cunningfox

    Cunningfox Byte Poster

    219
    6
    27
    It'll be fine (famous last words). I've replaced a ton of these disks in Dell 2850s, I'm actually amazed you managed to get a 36gb one tbh :P.

    I'd always do it hot, ensuring a full backup is taken just in case. I've had to use it though. It'll probably go green (online), then lights out (degraded), then blinking green while rebuilding the set. Leave it overnight and next morning golden.

    The error on boot was almost certainly the server picking up the new disk, I think the older Dell boxes got a little fussy sometimes on boot although the 2950's were sweet. Continuing would have just resulted in the rebuild anyway.

    You probably should be aware that some Dell servers had an issue where the disks (and batteries) would always go to predicted failure at a certain age regardless of the disk (or battery) state. They fixed it with firmware updates I believe. It's normally worth doing a check with Dell support (if covered) to save stress.
     
    Certifications: CCNP, CCNA, MCP
    WIP: ??
  18. Sparky
    Highly Decorated Member Award 500 Likes Award

    Sparky Zettabyte Poster Moderator

    10,718
    543
    364
    If there is a RAID issue when booting up you will probably get a message like "2 Virtual Disks detected - 1 degraded" and then the server will boot up.
     
    Certifications: MSc MCSE MCSA:M MCSA:S MCITP:EA MCTS(x5) MS-900 AZ-900 Security+ Network+ A+
    WIP: Microsoft Certs

Share This Page

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.