Browse Category

Hardware

Diagnosing a Faulty Disk

Recently, I had a lot of difficulty booting into Windows on my home laptop. More specifically, I could boot, but the OS would very quickly become unresponsive. However, I had no such issues booting into Ubuntu (and I’m typing this from that). I tried booting into Safe Mode and disabling auxiliary services and devices, but that was to no avail.

Nonetheless, I could use Ubuntu to perform some analysis. I noticed rather unpleasant-looking logs in the Linux message buffers (a.k.a. dmesg).

[   20.588310] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   20.588311] ata6.00: irq_stat 0x40000001
[   20.588312] ata6.00: failed command: READ DMA
[   20.588315] ata6.00: cmd c8/00:08:40:28:1c/00:00:00:00:00/e0 tag 22 dma 4096 in
[   20.588315]          res 51/40:08:40:28:1c/00:00:00:00:00/e0 Emask 0x9 (media error)
[   20.588316] ata6.00: status: { DRDY ERR }
[   20.588316] ata6.00: error: { UNC }
[   20.619619] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[   20.647929] ata6.00: configured for UDMA/133
[   20.647935] sd 5:0:0:0: [sdb]  
[   20.647936] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   20.647937] sd 5:0:0:0: [sdb]  
[   20.647938] Sense Key : Medium Error [current] [descriptor]
[   20.647939] Descriptor sense data with sense descriptors (in hex):
[   20.647942]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[   20.647943]         00 1c 28 40 
[   20.647944] sd 5:0:0:0: [sdb]  
[   20.647945] Add. Sense: Unrecovered read error - auto reallocate failed
[   20.647946] sd 5:0:0:0: [sdb] CDB: 
[   20.647948] Read(10): 28 00 00 1c 28 40 00 00 08 00
[   20.647949] end_request: I/O error, dev sdb, sector 1845312
[   20.647955] ata6: EH complete

Hmm, UNC stands for uncorrectable; not great. These errors were easily reproduced. Upon finding that I could nonetheless seem to view most of the contents of the file-system when mounting the SSD in Linux, I immediately took a backup of the important data to an external hard drive, though this was probably objectively unnecessary (really important data is backed up in the cloud).

My first instinct was then to carry out a self-test of the drive, using its SMART (self-monitoring, analysis and reporting technology) tools. I first performed an extended “self-test” of the drive, which seemed to yield suspiciously positive results.

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4874         -

I decided to probe further. The tool’s output includes a table of various attributes which can be metrics of a drive’s health:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0003   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       2664
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       2577
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       2
171 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0
172 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       102815
174 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       59
175 Program_Fail_Count_Chip 0x0003   000   000   000    Pre-fail  Always       -       0
176 Erase_Fail_Count_Chip   0x0003   100   100   000    Pre-fail  Always       -       0
177 Wear_Leveling_Count     0x0003   100   100   000    Pre-fail  Always       -       102815
178 Used_Rsvd_Blk_Cnt_Chip  0x0003   100   100   000    Pre-fail  Always       -       1
179 Used_Rsvd_Blk_Cnt_Tot   0x0003   000   000   000    Pre-fail  Always       -       2
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       784
181 Program_Fail_Cnt_Total  0x0003   100   100   000    Pre-fail  Always       -       0
182 Erase_Fail_Count_Total  0x0003   100   100   000    Pre-fail  Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   001   001   000    Pre-fail  Always       -       26018192
187 Reported_Uncorrect      0x0003   100   100   000    Pre-fail  Always       -       2
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0003   100   100   000    Pre-fail  Always       -       59
195 Hardware_ECC_Recovered  0x0003   100   100   000    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0003   100   100   000    Pre-fail  Always       -       2
198 Offline_Uncorrectable   0x0003   100   100   000    Pre-fail  Always       -       1
199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       0
232 Available_Reservd_Space 0x0003   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0003   100   100   000    Pre-fail  Always       -       6425
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       214612
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       214201

This is a rather large and nasty table, and it didn’t seem that Plextor implemented any meaningful values for the thresholds. Nonetheless, raw data counts were available. The first alarm bell was metric number 184, End-to-End Error (which sounds terrible); that had apparently happened over 26 million times. Some sources suggest this is critical while others do not; I don’t have historical data regarding the progression of this figure, so it would be difficult to draw conclusions as to how this happened – or, if the sources saying this is a critical metric are correct, how the disk limped to this point in the first place.

Nonetheless, there were other negative indicators as well; 187, 188 and 198 which have been associated with disk failures were all notably more than zero. There were several other ATA errors appearing in the smartctl output as well.

For comparison, I ran the diagnostic tools on my HDD:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   118   118   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   098   098   000    Old_age   Always       -       4642
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   084   084   000    Old_age   Always       -       7425
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       2574
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0012   072   072   000    Old_age   Always       -       283355
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 10/58)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

Much better. I looked up 199 just in case, but it seemed fine (just one occurrence; and in any case the super important data has been backed up). Notice 197, 198 and 5 (bad sector reallocations) are all zero.

The current situation is fine (the slower startup times are a little unpleasant, though would probably have been worse with Windows). I might investigate replacing the disk and/or getting a new machine (this laptop is reaching 3 years, so an upgrade would be nice) when my budget allows for it. That said, my usage patterns don’t seem to suggest a higher-end machine is necessary at all, so I might stick with it (Sims 4 isn’t that demanding; the most demanding thing I played would probably be Fallout 4 and while I could use a GPU upgrade, that’s clearly a want, not a need). I haven’t quite decided yet.