How should I burn in hard drives?

hard-drive drive-failure

27,116

Solution 1

How important is it to burn in a hard drive before you start using it?

If you have a good backup, and good high-availability systems, then not very much. Since restoring from a failure should be pretty easy.

How do you implement a burn-in process? What software do you use to burn in drives? How much stress is too much for a burn-in process?

I will typically run badblocks against a drive or new system when I get it. I will run it whenever I resurrect a computer from the spares pile. A command like this (badblocks -c 2048 -sw /dev/sde) will actually write to every block 4 times each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test does not do anything to test lots of random reads/writes, but it should prove that every block can be written too and read.

You could also run bonnie++, or iometer which are benchmarking tools. These should try to stress your drives a bit. Drives shouldn't fail even if you try to max them out. So you might as well try to see what they can do. I do not do this though. Getting an I/O benchmark of your storage system right at install/setup time may be very useful in the future when you are looking at performance issues.

How long do you burn in a hard drive?

A single run of badblocks is enough in my opinion, but I believe I have a very strong backup system, and my HA needs are not that high. I can afford some downtime to restore service on most of the systems I support. If you are so worried, that you think a multi-pass setup may be required, then you probably should have RAID, good backups, and a good HA setup anyway.

If I am in a rush, I may skip a burn-in. My backups, and RAID should be fine.

Solution 2

IMNSHO, you shouldn't be relying on a burn-in process to weed out bad drives and "protect" your data. Developing this procedure and implementing it will take up time that could be better used elsewhere and even if a drive passes burn-in, it may still fail months later.

You should be using RAID and backups to protect your data. Once that is in place, let it worry about the drives. Good RAID controllers and storage subsystems will have 'scrubbing' processes that go over the data every so often and ensure everything is good.

Once that all is taken care of, there's no need to do disk scrubbing, though as others have mentioned it doesn't hurt to do a system load test to ensure that everything is working as you expect. I wouldn't worry about individual disks at all.

As has been mentioned in the comments, it doesn't make a lot of sense to use hard drives for your particular use case. Shipping them around is far more likely to cause data errors that won't be there when you did the burn-in.

Tape media is designed to be shipped around. You can get 250MBps (or up to 650MBps compressed) with a single IBM TS1140 drive which should be faster than your hard drive. And bigger as well - a single cartridge can give you up to 4TB (uncompressed).

If you don't want to use tape, use SSDs. They can be treated far rougher than HDDs and satisfy all the requirements you've given so far.

After all that, here are my answers to your questions:

How important is it to burn in a hard drive before you start using it?
Not at all.
How do you implement a burn-in process?
- How long do you burn in a hard drive?
  One or two runs.
- What software do you use to burn in drives?
  A simple run of, say, shred and badblocks will do. Check the SMART data afterwards.
How much stress is too much for a burn-in process?
No stress is too much. You should be able to throw anything at a disk without it blowing up.

Solution 3

Given your clarification, it doesn't sound like any burn-in process would be of any use to you. Drives fail primarily because of mechanical factors, usually heat and vibration; not because of any sort of hidden time-bomb. A "burn-in" process tests the installation environment as much as anything else. Once you move the thing, you're back to where you started.

But here are a few pointers that might help you:

Laptop drives are usually designed to withstand a more jostling and vibration than desktop drives. My friends who work in data-recovery shops always ship data to clients on laptop drives for that reason. I've never tested this fact, but it seems to be "common knowledge" in select industries.

Flash drives (e.g. USB thumb drives) are about the most shock-resistant of any medium you'll find. It should be even less likely that you'll loose data in transit if you use flash media.

If you ship a Winchester drive, do a surface scan before putting it in use. Or better yet, just don't put it into use. Instead, you may want to designate certain drives as "shipping" drives, which see all the abuse, but which you don't rely on for data integrity. (I.e.: copy data onto the drive for shipping, copy off after shipping, very checksums on both sides, that kind of thing).

Solution 4

I disagree with all the answers that basically say "Don't bother with burn-in, have good backups".

While you should always have backups, I spent 9 hours yesterday (on top of my usual 10-hour shift) restoring from backups because the system was running with drives that hadn't been burned in.

There were 6 drives in a RAIDZ2 config (ZFS equivalent to RAID-6) and we had 3 drives die over the course of 18 hours on a box that had been running for approximately 45 days.

The best solution I've found is to purchase drives from one particular manufacturer (don't mix-and-match), then run their provided tool for exercising the drives.

In our case we buy Western Digital and use their DOS-based drive diagnostics from a bootable ISO. We fire it up, run the option to write random garbage to the entire disk, then run the short SMART test followed by the long SMART test. That's usually enough to weed out all the bad sectors, read/write reallocations, etc...

I'm still trying to find a decent way to 'batch' it so I can run it against 8 drives at a time. Might just use 'dd if=/dev/urandom of=/dev/whatever' in Linux or 'badblocks'.

EDIT: I found a nicer way to 'batch' it. I finally got around to setting up a PXE boot server on our network to address a particular need, and noticed that the Ultimate Boot CD can be PXE booted. We now have a handful of junk machines sitting around that can be PXE booted to run drive diagnostics.

Solution 5

Your process is wrong. You should use raid arrays. Where I work we have made ruggedized raid arrays that are designed to get transported around. It's not rocket science.

Shock mounting the drives in oversize enclosures with big rubber vibration isolators will improve reliability hugely. (Seagate constellation-es drives, are as an example rated for 300G shock but only 2G vibration, non-operating: so the shipping case needs to vibration isolate the drive. http://www.novibes.com/Products&productID=62 or http://www.novibes.com/Products&productId=49 [part #50178])

However, if you really want to burn-in-test hard drives, so here it goes.

I've worked on systems like hard drives and burn in found some problems but...

For accelerated lifecycle testing of PCBs to bring out faults, nothing beats some hot/cold cycles. ( operating hot-cold cycles works even better... but it's harder for you to do, especially with banks of HDD's)

Get yourself an environmental chamber big enoug for the number of drives you acquire at a time. ( These are pretty expensive, it'd be cheaper to ship raid arrays around) You can't skimp on the test chambers you will need humidity control and programmable ramps.

Program in two repeating temperature ramps, down to minimum storage temp, up to maximum storage temp, make the ramps steep enough to upset the application engineer from you hard drive manufacturer. 3 cold-hot cycles in 12 hours should see the drives failing pretty quickly. Run the drives at least 12 hours like this. If any work afterwards I'll be surprised.

I didn't think this up: One place I worked we had a production engineer did this, to get more products shipped with the same test equipment, there was a huge surge in faults in test, but the dead on arrival rate dropped to practically zero.

View more solutions

27,116

Phil

Updated on September 18, 2022

Comments

Phil almost 2 years
Google did a very thorough study on hard drive failures which found that a significant portion of hard drives fail within the first 3 months of heavy usage.

My coworkers and I are thinking we could implement a burn-in process for all our new hard drives that could potentially save us some heartache from losing time on new, untested drives. But before we implement a burn-in process, we would like to get some insight from others who are more experienced:
- How important is it to burn in a hard drive before you start using it?
- How do you implement a burn-in process?
  - How long do you burn in a hard drive?
  - What software do you use to burn in drives?
- How much stress is too much for a burn-in process?
EDIT: Due to the nature of the business, RAIDs are impossible to use most of the time. We have to rely on single drives that get mailed across the nation quite frequently. We back up drives as soon as we can, but we still encounter failure here and there before we get an opportunity to back up data.

UPDATE

My company has implemented a burn-in process for a while now, and it has proven to be extremely useful. We immediately burn in all new drives that we get in stock, allowing us to find many errors before the warranty expires and before installing them into new computer systems. It has also proven useful to verify that a drive has gone bad. When one of our computers starts encountering errors and a hard drive is the main suspect, we'll rerun the burn-in process on that drive and look at any errors to make sure the drive actually was the problem before starting the RMA process or throwing it in the trash.

Our burn-in process is simple. We have a designated Ubuntu system with lots of SATA ports, and we run badblocks in read/write mode with 4 passes on each drive. To simplify things, we wrote a script that prints a "DATA WILL BE DELETED FROM ALL YOUR DRIVES" warning and then runs badblocks on every drive except the system drive.
- Admin almost 13 years
  
  Curiously, why do you mail the drives across the country? It sounds like you're more likely to run into drive failure as the result of shock failure and general mishandling by the postal service than by reasons a burn-in would detect.
- Admin almost 13 years
  
  Do these drives end up in servers or desktops?
- Admin almost 13 years
  
  Why not a pair of very high speed Internet connection and data duplication tool (e.g. rsync)? If you need security, SSH will do. And why not a set of 128 GB quad-layer Bluray disks or SSDs? I'm just thinking that you may be using a non-optimal process in the first place if you really need to mail harddisks frequently, I can't think of a valid use case where mailing harddisk is the optimal solution. And Google's study would be misleading in your use case, since their study did not involve cases where drives are mailed around frequently.
- Admin almost 13 years
  
  @Lie Ryan: Mailing storage hardware is still the fastest way to transfer data. The internet is very slow, and if you need to pass through institutional networks and firewalls, it gets even worse.
- Admin almost 13 years
  
  @Jonas: Using a pair of 22Mbit/s business connection (2*$100/month, you can transfer a total of 7TB/month. Sending the same amount of data through UPS would cost 7*$60=$420 (one day delivery, $60 for a 0.5kg package, 1TB harddisks). And you have practically zero latency between data generation and data received, and zero worry about damaged harddisk.
- Admin almost 13 years
  
  @Lie Ryan: Exatly. Even with a really fast internet connection, it takes you a month to deliver 7TB (and that's if you have the luxury of having control over your and the recipient's throughput). With UPS, the 7TB are there the next day.
- Admin almost 13 years
  
  @Jonas: with a fast Internet connection, your data starts arriving instantly and if you can use rsync, it can often deflate those 7TB transfer to a few gigabytes. The only time you cannot use rsync is if you're generating 7TB of new and unique data daily, and you probably will only be capable to generate that amount of new and unique data if you work in something like the Large Hadron Collider (although people in the LHC has a freaking fast internet connection as well, scientists around the world are just too impatient to wait for a next day delivery).
- Admin almost 13 years
  
  @Jonas: moreover, if you really were generating 7TB of data daily, then you can use the $360/month 100 MB/s connection (bandwidth ~7TB/day), and the price difference wouldn't really matter since the computer and other equipments that can record and process all that data probably would cost you much more than your Internet connection fee.
- Admin almost 13 years
  
  That's nice and everything, except when you have mobile teams out on the road collecting massive amounts of data. Meaning you have to rely on hotel or mobile internet connections, and you are never in one spot for more than 8 hours.
- Admin over 10 years
  
  I'd like to add that 3Ware does recommend doing a 48 burn in (see top of PDF). Also, the OP could mail both component drives of a software RAID around. Those are easily reassembled into an array in another computer.
Rob Moir almost 13 years

My only regret about this post is that I can only upvote it once. @Phil, you're re-inventing the wheel there. The way to ensure you don't lose data to random drive (or other, related hardware) failures are backups and RAID arrays.
jsmith almost 13 years

This answer has the highest votes, yet has failed to answer the question the most. The OP has stated that RAIDs are impossible. If "single drives are getting mailed across the nation" then a raid cannot be built. A backup process is in place, but it seems that the OP is wanting to find anything and everything to help save time in case a drive goes bad. (Note: I work at the same company as the OP, so I know the situation, the data gets copied to a RAID once it is delivered) If you HAD to burn in drives, hypothetically, how would you?
jsmith almost 13 years

What if the drive cannot be monitored because it is in not in one constant place? :)
pfo almost 13 years

I can only confirm google's statement, never had a 100+ disk array that did not have at least one or two drives die during the first months. A proper RAID design with global hot spares for every type of disk used in the array deals with that. Good enterprise arrays have call-home support so that dead drives get out of your array ASAP.
jsmith almost 13 years

@MikeyB I agree, I'm just trying to provide that information so that you can provide a better answer. Tapes cannot write data as fast as needed, so they are not an option. The drives need to be at least 10k rpm to have the correct functionality. I agree in a perfect world he could send two drives, one to do a nightly backup. And in a perfect world, this decision would be in the OP's hands. It's just not what is needed. "How should I burn in hard drives?" is the question. :)
Steve Townsend almost 13 years

Tapes can not write data as fast as hard drives? 250MBps (uncompressed) isn't good enough for you? I would say that a burn-in process isn't actually going to help protect against failures with your use case.
Zoredache almost 13 years

I agree with the point @MikeyB raises. If a drive is being shipped around then any testing you did prior to shipping it is probably void. If you need to test post shipping I suspect I would generate an md5sums/sha256sums file for all the files on the drive and send that separate from the drives. When you receive the drive with the files verify the hashes.
ddm-j almost 13 years

Also, use RAID6, not RAID5. Losing all redundancy because of one failing drive and subsequently putting all the remaining drives under the extra stress of a rebuild is a bad thing.
voretaq7 almost 13 years

@jsmith - then you have it send you notifications rather than polling it from a monitoring host -- there are almost no situations where you genuinely can't monitor something, there are just ones that require a little creative thinking :)
Karoly Horvath almost 13 years

"Once you move the thing, you're back to where you started." - not true. this is how an MTTF graph for a disk looks like: cs.cmu.edu/~bianca/fast/img13.png cs.ucla.edu/classes/spring09/cs111/scribe/16/… testing in the early period can filter out a lot of problematic drives, manufacturers actually do these stress tests to make more reliable product but of course it's not economical to do the test till the graph flattens out.
Lie Ryan almost 13 years

@yi_H: I think tylerl is implying that probably the cause of early failure is due to the shipping process itself (which isn't implausible, I'd like to see a graph plotting drive failure vs mailing frequency), therefore if you ship harddrive across the country, then the MTTF will reset; so if you mail a drive every three months, you're always in the early failure zone.
datu-puti almost 8 years

What do you know? An answer that answers the question and doesn't preach at the OP. +1
alvinjorrel almost 7 years

Excellent suggestion @MichaelKjörling