Wednesday, September 12, 2007
Hard Drive Failure: Various Types
Recently I've been around a lot of failing hard drives and a lot of different modes of failure. Here's the story of four different hard drives that died. (Note: File System Failure isn't really hard drive failure but it's an interesting story anyway.)
File System Failure
A customer shipped back a Dell Precision workstation that had been running Windows XP SP2. This machine would sit on the factory floor and run our custom image analysis application, 24x7. It had started blue-screening on boot up.
We powered it up and it blue screened. The error was STOP 0x00000024 NTFS_FILE_SYSTEM. This meant that the ntfs.sys driver that Windows uses to access the NTFS file system had crashed. We decided to rule out physical bad sectors on the disk first.
We booted into the Dell Utility Partition. Kudos to Dell here. This is very well made GUI utility with full mouse support and an extensive array of tests for all the hardware on the machine. We ran all the low-level disk tests and they indicated that at least physically, the disk was in great shape.
Then we popped in a Windows XP CD and booted up to the Recovery Console. Unfortunately, running chkdsk from the recovery console caused the same crash again! Ouch! Then we decided to install it as a slave on a different machine. That machine crashed on boot up now!
What was happening was this: the file system had become corrupted and it had become corrupted on such a way that it was causing ntfs.sys, file system driver, to crash every time it tried to access the hard drive.
Finally, we booted using a freshly-burned copy of Ubuntu 7.04. The NTFS drivers in Ubuntu worked like charm and we were able to recover all the necessary data from the drive. Then we repartitioned and reformatted it and it worked fine from then on.
Recoverable Disk Failure
This happened to a laptop at work. Like most disk failures, this too began with an ominous crash. On rebooting the system was extremely slow and the drive started making the noises. It was the dreaded click of death!
Unfortunately, this Toshiba laptop didn't come with any recovery utilities and they wouldn't have helped in this case as they would have to run from the same physical drive which was making noises. Luckily, we have a copy of SpinRite at work. I booted into SpinRite and started a Level 2 recovery. For the longest time, the drive kept making the clicking sounds and the progress indicator moved very slowly indeed. And then just like that, after 4% the sounds disappeared and progress indicator moved at a much faster speed. SpinRite finished its check and didn't report any problems. I booted back into Windows and backed up all my data. The hard drive didn't give me any more trouble but I decided to replace it with a newer Seagate drive anyway.
Unrecoverable Disk Failure
This happened to my own Compaq laptop. The system locked up while browsing. Thinking it was a typical Firefox hang, I forced a reboot. Sadly, my system never did reboot. I brought the laptop in to work and tried SpinRite again. No dice! The drive wasn't even recognized by the BIOS. This drive happened to be a Death Star and it clicked like one by this point. (Well it was a TravelStar actually.)
After trying for a couple hours, I decided to try some non-traditional remedies. I tried whacking the hard drive. Didn't work. I tried freezing the hard drive. Didn't work. Finally, I gave up and started over.
Unrecoverable Disk Failure with a twist
This happened today on Dell Precision workstation that we use on our lab floor for prototyping. I came in to work in the morning and noticed that the machine had bluescreened with a failure in ftdisk.sys. I tried to reboot and the machine stalled saying it couldn't find a hard drive. And started making clicking sounds. (Yes, this was a DeskStar too.)
Learning from my past experiences, I decided to jump straight to SpinRite. I rebooted, pushed the CD tray button, popped in the CD and tried to press F12 to get me to the boot menu. No dice, the machine returned with a keyboard failure. I tried rebooting a few times but the problem continued. Then I turned it off, removed all extraneous connections to the machines (cables going to custom frame-grabber boards etc.), changed the keyboard and mouse and rebooted again. Keyboard failure. This was a very special machine wasn't it?
I then opened up the machine and removed the master drive which was failing. I put it in a different machine (an HP) as a master drive, popped in the SpinRite CD and tried again. Again, keyboard failure. There was something wrong with the circuitry on this drive that was causing the motherboard to return with a keyboard failure! At this point, I gave the bad news to my boss and basically set the drive aside and put all the open machines back together.
Then after lunch, on a whim, I decided to try again. But this time instead of pressing any buttons after rebooting, I just waited it out. And went to get a coffee. When I returned, the hard drive had stopped making noises and SpinRite had started. I ran a Level 2 scan and it finished with no problems. It's like the problem just hadn't happened at all. I told my boss the good news and backed up all the data onto a different hard drive. We're not taking any more chances with it. It gets replaced with a brand new drive tomorrow.
Lessons Learned
File System Failure
A customer shipped back a Dell Precision workstation that had been running Windows XP SP2. This machine would sit on the factory floor and run our custom image analysis application, 24x7. It had started blue-screening on boot up.
We powered it up and it blue screened. The error was STOP 0x00000024 NTFS_FILE_SYSTEM. This meant that the ntfs.sys driver that Windows uses to access the NTFS file system had crashed. We decided to rule out physical bad sectors on the disk first.
We booted into the Dell Utility Partition. Kudos to Dell here. This is very well made GUI utility with full mouse support and an extensive array of tests for all the hardware on the machine. We ran all the low-level disk tests and they indicated that at least physically, the disk was in great shape.
Then we popped in a Windows XP CD and booted up to the Recovery Console. Unfortunately, running chkdsk from the recovery console caused the same crash again! Ouch! Then we decided to install it as a slave on a different machine. That machine crashed on boot up now!
What was happening was this: the file system had become corrupted and it had become corrupted on such a way that it was causing ntfs.sys, file system driver, to crash every time it tried to access the hard drive.
Finally, we booted using a freshly-burned copy of Ubuntu 7.04. The NTFS drivers in Ubuntu worked like charm and we were able to recover all the necessary data from the drive. Then we repartitioned and reformatted it and it worked fine from then on.
Recoverable Disk Failure
This happened to a laptop at work. Like most disk failures, this too began with an ominous crash. On rebooting the system was extremely slow and the drive started making the noises. It was the dreaded click of death!
Unfortunately, this Toshiba laptop didn't come with any recovery utilities and they wouldn't have helped in this case as they would have to run from the same physical drive which was making noises. Luckily, we have a copy of SpinRite at work. I booted into SpinRite and started a Level 2 recovery. For the longest time, the drive kept making the clicking sounds and the progress indicator moved very slowly indeed. And then just like that, after 4% the sounds disappeared and progress indicator moved at a much faster speed. SpinRite finished its check and didn't report any problems. I booted back into Windows and backed up all my data. The hard drive didn't give me any more trouble but I decided to replace it with a newer Seagate drive anyway.
Unrecoverable Disk Failure
This happened to my own Compaq laptop. The system locked up while browsing. Thinking it was a typical Firefox hang, I forced a reboot. Sadly, my system never did reboot. I brought the laptop in to work and tried SpinRite again. No dice! The drive wasn't even recognized by the BIOS. This drive happened to be a Death Star and it clicked like one by this point. (Well it was a TravelStar actually.)
After trying for a couple hours, I decided to try some non-traditional remedies. I tried whacking the hard drive. Didn't work. I tried freezing the hard drive. Didn't work. Finally, I gave up and started over.
Unrecoverable Disk Failure with a twist
This happened today on Dell Precision workstation that we use on our lab floor for prototyping. I came in to work in the morning and noticed that the machine had bluescreened with a failure in ftdisk.sys. I tried to reboot and the machine stalled saying it couldn't find a hard drive. And started making clicking sounds. (Yes, this was a DeskStar too.)
Learning from my past experiences, I decided to jump straight to SpinRite. I rebooted, pushed the CD tray button, popped in the CD and tried to press F12 to get me to the boot menu. No dice, the machine returned with a keyboard failure. I tried rebooting a few times but the problem continued. Then I turned it off, removed all extraneous connections to the machines (cables going to custom frame-grabber boards etc.), changed the keyboard and mouse and rebooted again. Keyboard failure. This was a very special machine wasn't it?
I then opened up the machine and removed the master drive which was failing. I put it in a different machine (an HP) as a master drive, popped in the SpinRite CD and tried again. Again, keyboard failure. There was something wrong with the circuitry on this drive that was causing the motherboard to return with a keyboard failure! At this point, I gave the bad news to my boss and basically set the drive aside and put all the open machines back together.
Then after lunch, on a whim, I decided to try again. But this time instead of pressing any buttons after rebooting, I just waited it out. And went to get a coffee. When I returned, the hard drive had stopped making noises and SpinRite had started. I ran a Level 2 scan and it finished with no problems. It's like the problem just hadn't happened at all. I told my boss the good news and backed up all the data onto a different hard drive. We're not taking any more chances with it. It gets replaced with a brand new drive tomorrow.
Lessons Learned
- Hard drives fail. You might just get lucky and get all your data back. Or you might not get anything back.
- Back up regularly. External USB drives are pretty cheap right now. And there's some pretty good free backup software available. Also, use hard drive manufacturer tools regularly to run self tests. (e.g SeaTools by Seagate, Drive Fitness Test by Hitachi and Data Lifeguard Tools by Western Digital.) If your laptop BIOS and hard drive support it then turn on SMART monitoring.
- Keep a copy of Ubuntu or some other live CDs and tools handy. You never know when Windows will decide to repeatedly reboot.
- If you are working with a lot of machines and a lot of drives, consider investing in SpinRite. For $89 it's a really good deal.
Labels: backup, failure, procrastination
Link