what is exactly an URE?

20,757

Solution 1

A URE is an Unrecoverable Read Error. Something has happened that has caused the reading of a sector to fail that the drive cannot fix. The drive electronics are sophisticated, they will only pass the data up if they have been able to read it correctly from the disk. The drive electronics will try multiple times to read a bad sector before declaring it damaged.

What causes the read error - I'm not an expert here (arm waving ensues) but drive aging can cause manufacturing tolerances to become relevant. Magnetic domains can become weakened. Cosmic rays can cause damage etc. Essentially it is a random failure.

How does this affect RAID 5?

A RAID 5 consists of block level striping with distributed parity. The parity blocks are calculated by XORing the bits from the data blocks together. The XOR function basically says, if all the bits are the same the result is 0 otherwise it is 1. When calculating parity you take the first 2 bits and XOR them then XOR the result with the next bit and so on e.g.

1010   data      or    1010 data
1100   data            1100 data
0110   parity          0011 data
                       0101 parity

The nature of the XOR function is such that if any disk dies and is replaced, the data that should be on it can be reconstructed from the remaining disks.

1010  data       or    1010 data
      damaged               damaged
0101  parity           0011 data
                       0101 parity

As you can see the damaged data can be reconstructed by XORing the remaining data and parity.

How does a URE affect this?

A URE is only significant during a RAID 5 rebuild.

When you reconstruct a RAID 5 there is a large amount of reading to be done. Every data block needs to be read in order to reconstruct the data on the new disk. If a URE occurs then the data for the relevant block cannot be recovered so your data is inconsistent. For sufficiently large disks in a sufficiently large R5 the number of bits read to reconstruct the replaced disk exceeds the URE value of for example 1 bit in 10^14 read.

Solution 2

So what exactly is an URE, I mean concretely?

Hard disks do not simply store the data that you ask them to. Because of the ever-decreasing magnetic domain sizes, and the fact that hard disks store data in an analog rather than binary fashion (the hard disk firmware gets an analog signal from the platter, which is translated into a binary signal, and this translation is part of the manufacturer's secret sauce), there is virtually always some degree of error in a read, which must be compensated for.

To ensure that data can be read back, the hard disk also stores forward error correction data along with the data you asked it to store.

Under normal operations, the FEC data is sufficient to correct the errors in the signal that is read back from the platter. The firmware can then reconstruct the original data, and all is well. This is a recoverable read error which is exposed in SMART as the read error rate attribute (SMART attribute 0x01) and/or Hardware ECC Recovered (SMART attribute 0xc3).

If for some reason the signal degrades below a certain point, the FEC data is no longer sufficient to reconstruct the original data. At that point, the theory goes, the firmware will still be able to detect that the data could not be read back reliably, but it can't do anything about it. If multiple such reads fail, the disk has to somehow inform the rest of the computer that the read couldn't be performed successfully. It does so by signalling an unrecoverable read error. This also increases the Reported Uncorrectable Errors (SMART attribute 0xbb) counter.

An unrecoverable read error, or URE, is simply a report that for whatever reason, the payload data plus the FEC data was insufficient to reconstruct the originally stored data.

Keep in mind that URE rates are statistical. You won't encounter any hard disk where you can read exactly 10^14 (or 10^15) - 1 bits successfully and then the next bit fails. Rather, it's a statement by the manufacturer that on average, if you read (say) 10^14 bits, then at some point during that process you will encounter one unreadable sector.

Also, following on the last few words above, keep in mind that URE rates are given in terms of sectors per bits read. Because of how data is stored on the platters, the disk cannot tell which part of a sector is bad, so if a sector fails the FEC check, then the entire sector is considered to be bad.

Solution 3

the sector dies : as well totally unrecoverable, but here I do not understand why the 4TB disk is rated at 10^14 for the URE and the 8TB is as well rated at 10^14 for the URE, that would mean the sectors on the 8TB (most likely newer tech) are half as reliable as the ones on the 4TB, that does not make sense.

The specification is usually "on average 1 error is detected while reading n bits", so the drive size does not matter. It matters if you calculate your risk that an error will happen on your drive and workload, but the manufacturer only states that it takes n bits read to find an error (on average, not guaranteed).

Example: If you buy a 1TB drive, you would have to read it about 12 times to find an error, while an 8TB drive might experience it on the second read - but the number of bits read is the same both times, so the quality of the magnetic spindles is roughly the same.

What you pay for in increased price are other factors, ability to cram 8TB into the physical space of 1TB, greatly reduced energy consumption, fewer headcrashes while moving the drive etc.

Share:
20,757

Related videos on Youtube

Memes
Author by

Memes

Nothing to say that is not on my website. Or maybe my blog...

Updated on September 18, 2022

Comments

  • Memes
    Memes almost 2 years

    I have been looking into RAID5 Vs RAID6 lately and I keep seeing that RAID5 is not secure enough anymore because of the URE ratings and increasing size of the drives. Basically, most of the content I found says that in RAID5, in case you have a disk failure, if the rest of your array is 12TB, then you have almost 100% chance to meet a URE and to lose your data.

    The 12TB figure comes from the fact that disks are rated at 10^14 bits read to reach one URE.

    Well, there is something I do not get here. A read is done by the head going on the sector, what can make the reading failed is either the head dies or the sector dies. it can also be that the reading does not work for some other reason (I don't know, like a vibration made the head jumps...). so, let me address all 3 situations :

    • the reading does not work : that is not unrecoverable, right? it can be tried again.
    • the head dies : this would for sure be unrecoverable, but, that also means the full platter (or at least the side) would be unreadable, it would be more alarming, no?
    • the sector dies : as well totally unrecoverable, but here I do not understand why the 4TB disk is rated at 10^14 for the URE and the 8TB is as well rated at 10^14 for the URE, that would mean the sectors on the 8TB (most likely newer tech) are half as reliable as the ones on the 4TB, that does not make sense.

    As you see, from the 3 failure points I identify, none makes sense. So what exactly is an URE, I mean concretely?

    Is there somebody who can explain that to me?

    Edit 1

    After first wave of answers, it seems the reason is the sector failing. Good thing is that firmware, RAID controller and OS + filesystem have procedure in place to early detect that and reallocate sectors.

    Well, I now know what is a URE (actually, the name is quite self-explanatory :) ).

    I am still puzzled by the underlying causes and mostly the stable rating they give.

    Some attributed the failing sector to external sources (cosmic waves), I am then surprised that the URE rate is then based on the reading count and not on the age, the cosmic waves should indeed impact more an older disk simply because it has been exposed more, I think this is more of a fantasy though I might be wrong.

    Now comes the other reason that relates to the wear of the disk and some pointed out that higher densities give weaker magnetic domains, that totally makes sense and I would follow the explanation. But As it is nicely explained here, the newer disks different sizes are obtained mostly by putting more or less of the same platter (and then same density) in the HDD chassis. The sectors are the same and all should have the very same reliability, so bigger disks should then have a higher rating than smaller disks, the sectors being read less, this is not the case, Why? That would though explain why the newer disks with newer tech get no better rating than the old ones, simply because the better tech gain is offseted by the loss due to higher density.

    • Sirex
      Sirex over 7 years
      "URE and to lose your data" afaik (and i may be wrong), a URE means only that some data is lost, not all of it - and you can try the rebuild again after hitting the URE. That said, raid 10 or zfs is kinda where it's at these days.
    • MadHatter
      MadHatter over 7 years
      "sectors [on newer discs] are half as reliable as [on the old], that does not make sense" I'm not sure I agree. As the magnetic zones become ever smaller (which higher data densities in the same-size package implies), it's very reasonable that they become ever more susceptible to accidental erasure (local gamma-ray emissions, cosmic ray event, and so on). This increasing susceptibility of modern drives is why none of us would deploy un-RAIDed drives in anything that matters, and one reason why most of us have given up on RAID-5.
    • user
      user over 7 years
    • answer42
      answer42 over 7 years
      "that would mean the sectors on the 8TB (most likely newer tech) are half as reliable as the ones on the 4TB, that does not make sense." where are you getting that? The spec is saying that they're the same.
    • Memes
      Memes over 7 years
      @Sirex from all the calculations I saw, RAID 10 would not be that much safer actually. let's say I have on one side 3x4TB RAID5, 1 disk fails, for the rebuild, I have to read 8TB (or two third of the 12TB figure), I have no failsafe. now let's look at 4x4TB RAID10 (same 8TB total size), in case I have one disk failing, I need to rebuild whichever RAID1 sub-array has failed and the reading would be 4TB, so basically we could say twice as safe, because I also have no failsafe if I meet URE. Actually RAID6 would be the best here (safety wise). no?
    • Memes
      Memes over 7 years
      @MadHatter as far as I read, many newer 2TB disks are actually same platters as the 4TB but just half the number of platters. so, if that is true, then the newer 2TB should be as "bad" as the newer 4TB and we actually should see a URE rating going down for a given size of disks. no?
    • Memes
      Memes over 7 years
      @MadHatter this confirms my comment above rml527.blogspot.hk/2010/10/…
    • Memes
      Memes over 7 years
      @MSalters thanks, yes, it seems to be the key element indeed.
    • Memes
      Memes over 7 years
      @hobbs if you have twice as many sectors, you half half as much read per sector, So per my assumption, the wear is the reason, then the 8TB sector are less reliable. that is where I get that.
    • answer42
      answer42 over 7 years
      @Memes no, the numbers cancel out. Twice as many sectors is also twice as many opportunities for failure, so the same read error rate equals the same reliability on a per-byte basis. Which is why it's used in the first place.
  • MadHatter
    MadHatter over 7 years
    A single 8TB disc has over 6*10^13 bits on, so with merely three such discs in a RAID-5, a URE is more likely than not during a reconstruct. Oh, and +1 from me.
  • user
    user over 7 years
    URE rates are quoted in full sectors per bits read (or its inverse). So if the disk uses 4,096-byte sectors, that single URE botches the whole sector.
  • user9517
    user9517 over 7 years
    Yes, that's why I said Something has happened that has caused the reading of a sector to fail that the drive cannot fix
  • Memes
    Memes over 7 years
    OK, so it seems to point towards the sector failing. I totally get the statistics things, no worry. I also see here that the reliability of the sector goes decreasing as the density goes higher, but that still does not make sense. Newer disks usually have the same platter density no matter the physical size, the 4TB will just have less platters than the 6TB. Basically the sectors are the same, so why the 8TB is not able to achieve statistically a higher value, there are twice as many sectors so each is read half as much (statistically). they should then fail less, no?
  • Memes
    Memes over 7 years
    sorry, it still does not make sense to me. if the reason is the cosmic rays, then, it does not matter how many reads you do but mostly how old are the drives, because older drives have more chances to have been exposed, no? concerning the weakened magnetic domains, here again, I question the cause, if it is external, then having twice as many makes the chance to catch the trigger higher, bigger drives should have lower ratings. if it is the wear of the domains by its readings, then bigger drives should have higher rating as they are statistically read less.
  • Memes
    Memes over 7 years
    ok, now I see that if it is a mix, that might very well balance out :)
  • David Balažic
    David Balažic over 6 years
    The claim (written in the question and in some answers and comments, also in other questions, in fact all over the internet) that after reading 12TB a read error is almost certain is false. Don't believe it? Don't. Know it. By reading 12 (or more) TB from any of your disks and observing that no error happened. Please do it and stop this myth. Thank you.
  • Ian Kemp
    Ian Kemp about 5 years
    @DavidBalažic Since 2016 most consumer disks have been uprated to a URE of 10^15, which means you get 125TB of reads before a URE occurs - more than sufficient for most users. But if you have disks that are only rated to 10^14, then it's almost certain that you will encounter a URE after reading 12.5TB. You might get lucky and have no URE, but when it comes to data integrity, one doesn't rely on luck.
  • David Balažic
    David Balažic about 5 years
    @IanKemp No it isn't. I tried it. You obviously didn't. (also, the better rating just moves the myth a bit, no real change)
  • Ian Kemp
    Ian Kemp about 5 years
    @DavidBalažic Evidently, your sample size of one invalidates the entirety of probability theory! I suggest you submit a paper to the Nobel Committee.
  • David Balažic
    David Balažic about 5 years
    @IanKemp If someone claims that all numbers are divisible by 7 and I find ONE that is not, then yes, a single find can invalidate an entire theory. BTW, still not a single person has confirmed the myth in practice (by experiment), did they? Why should they, when belief is more than knowledge...
  • David Balažic
    David Balažic about 5 years
    @IanKemp after one week still no URE, huh?
  • Murmel
    Murmel over 4 years
    @DavidBalažic If someone would claim P("URE in 10^15 reads") = 1, you could debunk this with one experiment showing you actually did not see one URE while performing 10^15 reads - true. But as a URE is only expected to happen at most after 10^x (assuming which probability distribution?), you can't proof anything even with a sample size of one (tbh even bigger sample sizes would not proof a lot more).
  • David Balažic
    David Balažic over 4 years
    @Murmel still waiting for a one confirmed case of this myth ...