Trying to remove/diagnose single Current_Pending_Sector in S.M.A.R.T. data

15,357

Solution 1

A sector is marked pending when a read fails. The pending sector will be marked reallocated if a subsequent write fails. If the write succeeds, it is removed from current pending sectors and assumed to be ok. (The exact behavior could differ slightly and I'll go into that later, but this is a close enough approximation for now.)

When you run badblocks -w, each pattern is first written, then read. It's possible that the write to the flaky sector succeeds but the subsequent read fails, which again adds it to the pending sector list. I would try writing zeroes to the entire disk with dd if=/dev/zero of=/dev/sda, checking the SMART status, then reading the entire disk with dd if=/dev/sda of=/dev/null and checking the SMART status again.

Update:

Based on your earlier results with badblocks -w, I would have expected the pending sector to be cleared after writing the entire disk. But since that didn't happen, it's safe to say this disk is not behaving as expected.

Let's review the description of Current Pending Sector Count:

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.[29] However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

Now let's review the important points:

...the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.[29] However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good.

In other words, the pending sector should have either been remapped immediately, or the drive should have attempted to write to the sector and one of two things should have happened:

  1. The write failed, in which case the pending sector should have been remapped.
  2. The write succeeded, in which case the pending sector should have been cleared ("marked good").

I hinted at this earlier, but Wikipedia's description of Current Pending Sector suggests that the current pending sector count should always be zero after a full disk write. Since that is not the case here, we can conclude that either (a) Wikipedia is wrong (or at least incorrect for your drive), or (b) the drive's firmware cannot properly handle this error state (which I would consider a firmware bug).

If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased.

Since the current pending sector count is still unchanged after reading the entire drive, we can assert that either (a) the sector could not be successfully read or (b) the sector was successfully read and marked good, but there was an error reading a different sector. But since the reallocated sector count is still 0 after the read, we can exclude (b) as a possibility and can conclude that the pending sector was still unreadable.

At this point, it would be helpful to know if the drive has logged any new SMART errors. My next suggestion was going to be to check whether Seagate has a firmware update for your drive, but it looks like they don't.

Although I would recommend against continuing to use this drive, it sounds like you might be willing to accept the risks involved (namely, that it could continue to act erratically and/or could further degrade or fail catastrophically). In that case, you can try to install Linux, boot from a rescue CD, then (with the filesystems unmounted) use e2fsck -l filename to manually mark the appropriate block as bad. (Just make sure you maintain good backups!)

e2fsck -l filename

Add the block numbers listed in the file specified by filename to the list of bad blocks. The format of this file is the same as the one generated by the badblocks(8) program. Note that the block numbers are based on the blocksize of the filesystem. Hence, badblocks(8) must be given the blocksize of the filesystem in order to obtain correct results. As a result, it is much simpler and safer to use the -c option to e2fsck, since it will assure that the correct parameters are passed to the badblocks program.

(Note that e2fsck -c is preferred to e2fsck -l filename, and you might even want to try it, but based on your results thus far, I highly doubt e2fsck -c will find any bad blocks.)

Of course, you'll have to do some arithmetic to convert the LBA of the faulty sector (as provided by SMART) into a filesystem block number. The Bad Blocks HowTo provides a handy formula:

  b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.

The HowTo also contains a complete example using this formula. After the OS is installed, you can confirm whether a file is occupying the flaky sector using debugfs (see the HowTo for detailed instructions).

Another option: partition around the suspected bad block When you install your OS, you could also try to partition around the error. If I did my arithmetic right, the error is at around 81.589 MB, so can either make /boot a little small and start your next partition after sector 167095, or skip the first 82 MB or so completely.

ABRT 235018779 Unfortunately, as for the ABRT error at sector 235018779, we can only speculate, but the ATA8-ACS spec gives us some clues.

From Working Draft AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS):

6.2.1 Abort (ABRT) Error bit 2. Abort shall be set to one if the command is not supported. Abort may be set to one if the device is not able to complete the action requested by the command. Abort shall also be set to one if an address outside of the range of user-accessible addresses is requested if IDNF is not set to one.

Looking at the commands leading up to the ABRT (several READ SECTOR(S) followed by recalibration and reinitialization)...

Abort shall be set to one if the command is not supported. - This seems unlikely.

Abort may be set to one if the device is not able to complete the action requested by the command. - Maybe the P-list of reallocated sectors shifts the user-accessible addresses far enough that a user-accessible address translated to sector 235018779, and the read operation was not able to complete (for what reason, we don't know...but there wasn't a CRC error, so I don't think we can conclude that sector 235018779 is bad).

Abort shall also be set to one if an address outside of the range of user-accessible addresses is requested if IDNF is not set to one. - To me this seems most likely, and I would probably interpret it as the result of a software bug (either your OS or some program you were running). In that case, it is not a sign of impending doom for the hard drive.

Just in case you're not tired of running diagnostics yet...

You could try smartctl -t long /dev/sda again to see if it produces any more errors in the SMART log, or you could leave this one as an unsolved X-file ;) and check the SMART log periodically to see whether it happens again. In any case, if you continue to use the drive without getting it to either reallocate or clear the pending sector, you're already taking a risk.

Use a checksumming filesystem

For a little more safety, you may want to consider using a checksumming filesystem such as ZFS or btrfs to help protect against low-level data corruption. And don't forget to perform frequent backups if you have anything that cannot be easily reproduced.

Solution 2

The article Bad Sector Remapping gives the algorithm used.

There are two lists of defects on the hard disk :

  • P-list are defects found during manufacture and are also known as Primary Defects. They sequentially follow the normal sectors. A bad sector will point to its replacement using a shift-number (first is +1, then +2 etc.).
  • G-List are defects that develop in normal use of the drive and are known as Grown Defects. There are no constraints on their allocation and they do not need to sequentially follow the P-list defects. A bad sector will point to its replacement using a simple sector number.

Therefore the fact that your bad sector is 577121 sectors beyond the normal last sector does not mean that you have 577121 bad sectors, unless it is a P-list defect. A G-list defect can be placed anywhere, so it's entirely possible that the firmware allocated it at the end of the spare sector space.

From wikipedia Known ATA S.M.A.R.T. attributes :

Reallocated Sectors Count

Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks that sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping, and reallocated sectors are called "remaps". The raw value normally represents a count of the bad sectors that have been found and remapped.

Current Pending Sector Count

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.

So in fact, pending errors are much worse than remapped, since the error is hard enough to prevent reading the original contents in order to remap. In effect, the contents of that sector are probably lost forever.

The document MHDD Very low level Hard Disk diagnostic tool explains the error-codes as :

UNC : data is uncorrectable
ABRT : command was aborted

So sector 167095 is uncorrectable and reading/writing to 235018779 was aborted.

As writing to both sectors did not change the status from pending to remapped, it seems to me that the replacement sector is also bad. My theory is that sector 167095 was remapped to sector 235018779, but that unfortunately the latter is also bad, and that the firmware does not know how to re-remap bad spare sectors. The result is an uncorrectable bad sector.

Share:
15,357

Related videos on Youtube

Ivan Kovacevic
Author by

Ivan Kovacevic

Updated on September 18, 2022

Comments

  • Ivan Kovacevic
    Ivan Kovacevic over 1 year

    I'm in a process of doing a fresh Linux install and before I went to do that I thought that it is a good time to verify HDD health since I can safely overwrite any data on the HDD if needed.

    First I tried checking with smartmontools... My Seagate HDD reports one current pending sector and one offline uncorrectable(presumably the same one). Reallocated sector count is zero.

    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
    ...
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
    

    However SMART self tests (short, long, offline, conveyance) find no errors.

    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%      6631         -
    # 2  Conveyance offline  Completed without error       00%      6630         -
    # 3  Extended offline    Completed without error       00%      6622         -
    # 4  Short offline       Completed without error       00%      6600         -
    # 5  Extended offline    Completed without error       00%      6632         -
    

    I've also tried running badblocks -wsv(full read-write 4 pattern pass test) on the drive and no bad blocks were found. I then followed the guide(to the extent possible, since I deleted my filesystem after running badblocks) found here: http://smartmontools.sourceforge.net/badblockhowto.html

    There it says that if I overwrite the sector with all zeros the disk should move(reallocate) the pending sector. Badblocks last write pattern is all zeros so that should have done it. however nothing has changed I still have that pending sector count 1.
    I then tried figuring out which sector is the problematic one and in the SMART output there is a error log:

    Error 2 occurred at disk power-on lifetime: 5344 hours (222 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 7c 1b 1a 02 ae  Error: ABRT at LBA = 0x0e021a1b = 235018779
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      20 20 7f 18 1a 02 ae 00      00:09:05.228  READ SECTOR(S)
      20 20 01 17 1a 02 ae 00      00:09:05.228  READ SECTOR(S)
      20 20 01 01 00 00 a0 00      00:08:59.830  READ SECTOR(S)
      91 20 3f 01 00 00 af 00      00:08:59.826  INITIALIZE DEVICE PARAMETERS [OBS-6]
      10 20 01 01 00 00 a8 00      00:08:59.678  RECALIBRATE [OBS-4]
    
    Error 1 occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 b7 8c 02 e0  Error: UNC at LBA = 0x00028cb7 = 167095
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      25 20 1e 9e 8c 02 e0 00      00:02:20.691  READ DMA EXT
      25 20 1e 80 8c 02 e0 00      00:02:20.691  READ DMA EXT
      25 20 1e 62 8c 02 e0 00      00:02:20.690  READ DMA EXT
      25 20 1e 44 8c 02 e0 00      00:02:20.690  READ DMA EXT
      25 20 1e 26 8c 02 e0 00      00:02:20.690  READ DMA EXT
    

    So apparently the drive had two errors.

    84 51 7c 1b 1a 02 ae  Error: ABRT at LBA = 0x0e021a1b = 235018779
    

    and

    40 51 00 b7 8c 02 e0  Error: UNC at LBA = 0x00028cb7 = 167095
    

    So I assumed these are the sector numbers: 167095 and 235018779. And I tried writing zeros with dd:

    dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=167095
    

    Now that one did ok. However when I tried with the other sector:

    dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=235018779
    

    I get dd: '/dev/sda': cannot seek: Invalid argument. I then spotted that my HDD only has 234441658 sectors. So this is out of range. But then why did SMART report an error on that address?!

    Can anyone help me figure that out and also advise me how to do this correctly if I'm doing it wrong? I suspect that maybe I'm wrong in using block size 512 with dd. That is the sector size reported by SMART. maybe those LBA addresses are bytes not blocks I tried setting bs=1 and writing only one byte to those addresses on the HDD. That did work(dd write process)… However pending sector count still did not change after that. I also called sync and smartctl -t offline /dev/sda to try 'forcing' the drive to reallocate the sector. Nothing...

    Here is my full smartctl --all /dev/sda output:

    smartctl 5.43 2012-06-30 r3573 [i686-linux-2.6.32-358.el6.i686] (local build)
    Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
    
    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.9
    Device Model:     ST3120811AS
    Serial Number:    6PT1N4VZ
    Firmware Version: 3.AAE
    User Capacity:    120,034,123,776 bytes [120 GB]
    Sector Size:      512 bytes logical/physical
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   7
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Mon Nov 18 12:03:00 2013 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82) Offline data collection activity
                        was completed without error.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:        (  430) seconds.
    Offline data collection
    capabilities:            (0x5b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        No Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:    (   1) minutes.
    Extended self-test routine
    recommended polling time:    (  51) minutes.
    
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   084   077   006    Pre-fail  Always       -       185600113
      3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2185
      5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   073   055   030    Pre-fail  Always       -       25890559714
      9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6632
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   098   098   020    Old_age   Always       -       2229
    187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   071   056   045    Old_age   Always       -       29 (Min/Max 25/29)
    194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (0 13 0 0 0)
    195 Hardware_ECC_Recovered  0x001a   052   046   000    Old_age   Always       -       194244099
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
    202 Data_Address_Mark_Errs  0x0032   066   219   000    Old_age   Always       -       34
    
    SMART Error Log Version: 1
    ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 2 occurred at disk power-on lifetime: 5344 hours (222 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 7c 1b 1a 02 ae  Error: ABRT at LBA = 0x0e021a1b = 235018779
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      20 20 7f 18 1a 02 ae 00      00:09:05.228  READ SECTOR(S)
      20 20 01 17 1a 02 ae 00      00:09:05.228  READ SECTOR(S)
      20 20 01 01 00 00 a0 00      00:08:59.830  READ SECTOR(S)
      91 20 3f 01 00 00 af 00      00:08:59.826  INITIALIZE DEVICE PARAMETERS [OBS-6]
      10 20 01 01 00 00 a8 00      00:08:59.678  RECALIBRATE [OBS-4]
    
    Error 1 occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 b7 8c 02 e0  Error: UNC at LBA = 0x00028cb7 = 167095
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      25 20 1e 9e 8c 02 e0 00      00:02:20.691  READ DMA EXT
      25 20 1e 80 8c 02 e0 00      00:02:20.691  READ DMA EXT
      25 20 1e 62 8c 02 e0 00      00:02:20.690  READ DMA EXT
      25 20 1e 44 8c 02 e0 00      00:02:20.690  READ DMA EXT
      25 20 1e 26 8c 02 e0 00      00:02:20.690  READ DMA EXT
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%      6631         -
    # 2  Conveyance offline  Completed without error       00%      6630         -
    # 3  Extended offline    Completed without error       00%      6622         -
    # 4  Short offline       Completed without error       00%      6600         -
    # 5  Extended offline    Completed without error       00%      6632         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    

    UPDATE:

    As suggested in the answer from rob I tried overwriting the entire HDD with zeroes. Checked SMART values and then started reading the whole HDD. Again checked SMART values. The result is: SMART values regarding the pending/reallocated sector count do not change, in both cases, immediately after write and then after read. Reallocated 0. Pending 1.

    • gronostaj
      gronostaj over 10 years
      I guess your drive has 234441658 sectors, but backup sectors remapped in place of bad sectors don't count into this number.
    • Ivan Kovacevic
      Ivan Kovacevic over 10 years
      Hmm, so that error on sector 235018779 would mean an error in backup sectors… Is that possible?
    • gronostaj
      gronostaj over 10 years
      Well, backup sectors can be corrupt too. Otherwise we would make "immortal" hard drives from backup sectors only.
    • Ivan Kovacevic
      Ivan Kovacevic over 10 years
      :) … Well my reasoning was that backup sectors are not in use(and therefor safe). I presumed that HDD surface can only get corrupted if disk head(s) make(s) an improper action, because of a power failure or something.
    • Ivan Kovacevic
      Ivan Kovacevic over 10 years
      Assuming that 235018779 sector is a backup sector. That means that I should have at least 235018779 - 234441658 = 577121 backup sectors. That is almost 282 MB in backup sectors. Seems a lot(too much) to me. Or is it? Just thinking out loud, maybe it's not a backup sector but a glitch in SMART diagnostics?
    • gronostaj
      gronostaj over 10 years
      It's just my wild guess, but I think it makes sense that backup sectors are indexed after normal sectors, that would match their physical location. 577121 sectors is just 0.2% of the total capacity, it's not that much if you think about it this way. Maybe that "normal" bad sector has already been remapped and the backup sector that replaced it failed too. Chances of that happening are low, but you know - Murphy's law...
    • rob
      rob over 10 years
      Keep in mind that badblocks -w writes, then reads each pattern. If the write succeeds, the pending sector is cleared until a subsequent read fails.
    • week
      week over 10 years
      I don't think that error ABRT point out at bad sector, more like bad address.
    • rob
      rob over 10 years
      Were any new SMART errors logged after writing and reading the entire disk?
    • Ivan Kovacevic
      Ivan Kovacevic over 10 years
      Nope, none! Seems fine! Yet somehow I hate that feeling of uncertainty.
    • SDsolar
      SDsolar almost 7 years
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Good idea, I will try that right now.
  • week
    week over 10 years
    What about trying this just with that bad sector 167095? :)
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Naah that's too boring :D. I'll try with the suspicious sector first, definitely a smart advice, if that doesn't do anything I will let it run on the whole drive just in case…
  • rob
    rob over 10 years
    @week that should do the trick but it seems he's having trouble zeroing in on the bad sector so that's why I suggested just doing the whole drive.
  • week
    week over 10 years
    Well, IF ABRT meant bad sector, than it's of no concern to user. Why bother with space, which is not allocated!? But again I don't think that means bad sector.
  • rob
    rob over 10 years
    @IvanKovacevic If you still have trouble clearing the pending sector by writing to a single sector or block, you might want to try it on to a range instead--for example, dd if=/dev/zero of=/dev/sda bs=1 count=100 skip=167050. Then if that doesn't clear the pending sector, try a larger range or the entire drive.
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Nice article, I learned something new definitely! However this still does not explain why the bad sector reported in SMART logs is even reported in the spare sector area and not in the normal usable space and why is the pending sector counter still 1 and reallocated sector counter 0. If everything worked as it should these two counters should have inverted their values.
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    OK, I'll do that. Thanks guys for the advices. I will report back in a minute or two...
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    I tried couple writes like dd if=/dev/zero of=/dev/sda seek=167095 bs=512 count=1 then I subtracted 512 from 167095, and tried from that lower sector with greater count like dd if=/dev/zero of=/dev/sda seek=166583 bs=512 count=100. Few times… then I checked smart values, nothing changed. After that I tried doing reads like dd if=/dev/sda of=/dev/null skip=167095 bs=512 count=1 and then again with bigger count from lower sector. couple times… Checked smart again. All the same! Nothing changed. Now I can try full disk.
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    ok that subtraction was stupid … because if I specify bs=512 then I should have subtracted only 1 not 512… or I should have put count to be at least 512 … I'll try that now. However when I did with single count that should have passed alright…
  • harrymc
    harrymc over 10 years
    See my edit above.
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Thanks! Great info! Now I have a question: Since 167095 was not remapped, is it advisable to use this HDD? Did the HDD just mark that sector as bad and will avoid using it in the future. Basically I need to decide: Can I proceed and install Linux, or should I throw away this HDD buy a new one and install Linux, or can I do something(execute a command) to mark that sector as bad manually and install Linux(my favorite option).
  • harrymc
    harrymc over 10 years
    A large disk with only two bad sectors does not merit being junked. As badblocks succeeded, hopefully it marked that sector as bad. I would try to install Linux to it, but do a full format if your distribution can do that during the installation. But if this is for an important production system, I would change the disk, just in case.
  • rob
    rob over 10 years
    If there is still a pending sector after writing to the entire drive, then the bad sector remapping is not working correctly and you should replace the drive (or, if you're a gambling man, continue using it knowing that it may behave erratically).
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Thank you for the updated info in your answer! Question: Since 167095 is a relatively low sector, basically at the beginning of the drive, it is certain that this sector will contain files after I install Linux. Now can e2fsck handle that? Does it transparently move the data from that sector to some other area and then marks the sector as bad?
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Also I have to note that in your answer you focused on sector 167095, but there is still a question what is with sector 235018779. Because maybe the drive figured that everything is OK with 167095 and just decreased the pending counter without reallocating. Then 14 days later(since that PC is on 24/7, power_on_hours matches real time) an error happened in spare sectors area namely at 235018779. Since the drive was not programmed to handle errors in spare sectors, basically the pending count has stuck at 1. Just a theory of mine, I don't know of course what happened...
  • rob
    rob over 10 years
    It's hard to say conclusively what caused the ABRT, but I've added some more to my answer.
  • Ivan Kovacevic
    Ivan Kovacevic over 10 years
    Thanks, I've just read your additions. I think you summed it up pretty well! I did run not one, but few: smartctl -t long /dev/sda since, and no new error appeared.