Recently, I had a lot of difficulty booting into Windows on my home laptop. More specifically, I could boot, but the OS would very quickly become unresponsive. However, I had no such issues booting into Ubuntu (and I’m typing this from that). I tried booting into Safe Mode and disabling auxiliary services and devices, but that was to no avail.
Nonetheless, I could use Ubuntu to perform some analysis. I noticed rather unpleasant-looking logs in the Linux message buffers (a.k.a. dmesg).
[ 20.588310] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 20.588311] ata6.00: irq_stat 0x40000001 [ 20.588312] ata6.00: failed command: READ DMA [ 20.588315] ata6.00: cmd c8/00:08:40:28:1c/00:00:00:00:00/e0 tag 22 dma 4096 in [ 20.588315] res 51/40:08:40:28:1c/00:00:00:00:00/e0 Emask 0x9 (media error) [ 20.588316] ata6.00: status: { DRDY ERR } [ 20.588316] ata6.00: error: { UNC } [ 20.619619] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs' [ 20.647929] ata6.00: configured for UDMA/133 [ 20.647935] sd 5:0:0:0: [sdb] [ 20.647936] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 20.647937] sd 5:0:0:0: [sdb] [ 20.647938] Sense Key : Medium Error [current] [descriptor] [ 20.647939] Descriptor sense data with sense descriptors (in hex): [ 20.647942] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 20.647943] 00 1c 28 40 [ 20.647944] sd 5:0:0:0: [sdb] [ 20.647945] Add. Sense: Unrecovered read error - auto reallocate failed [ 20.647946] sd 5:0:0:0: [sdb] CDB: [ 20.647948] Read(10): 28 00 00 1c 28 40 00 00 08 00 [ 20.647949] end_request: I/O error, dev sdb, sector 1845312 [ 20.647955] ata6: EH complete
Hmm, UNC stands for uncorrectable; not great. These errors were easily reproduced. Upon finding that I could nonetheless seem to view most of the contents of the file-system when mounting the SSD in Linux, I immediately took a backup of the important data to an external hard drive, though this was probably objectively unnecessary (really important data is backed up in the cloud).
My first instinct was then to carry out a self-test of the drive, using its SMART (self-monitoring, analysis and reporting technology) tools. I first performed an extended “self-test” of the drive, which seemed to yield suspiciously positive results.
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4874 -
I decided to probe further. The tool’s output includes a table of various attributes which can be metrics of a drive’s health:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0003 100 100 000 Pre-fail Always - 0 9 Power_On_Hours 0x0002 100 100 000 Old_age Always - 2664 12 Power_Cycle_Count 0x0003 100 100 000 Pre-fail Always - 2577 170 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2 171 Unknown_Attribute 0x0003 100 100 000 Pre-fail Always - 0 172 Unknown_Attribute 0x0003 100 100 000 Pre-fail Always - 0 173 Unknown_Attribute 0x0003 100 100 000 Pre-fail Always - 102815 174 Unknown_Attribute 0x0003 100 100 000 Pre-fail Always - 59 175 Program_Fail_Count_Chip 0x0003 000 000 000 Pre-fail Always - 0 176 Erase_Fail_Count_Chip 0x0003 100 100 000 Pre-fail Always - 0 177 Wear_Leveling_Count 0x0003 100 100 000 Pre-fail Always - 102815 178 Used_Rsvd_Blk_Cnt_Chip 0x0003 100 100 000 Pre-fail Always - 1 179 Used_Rsvd_Blk_Cnt_Tot 0x0003 000 000 000 Pre-fail Always - 2 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 000 Pre-fail Always - 784 181 Program_Fail_Cnt_Total 0x0003 100 100 000 Pre-fail Always - 0 182 Erase_Fail_Count_Total 0x0003 100 100 000 Pre-fail Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0033 001 001 000 Pre-fail Always - 26018192 187 Reported_Uncorrect 0x0003 100 100 000 Pre-fail Always - 2 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 19 192 Power-Off_Retract_Count 0x0003 100 100 000 Pre-fail Always - 59 195 Hardware_ECC_Recovered 0x0003 100 100 000 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0003 100 100 000 Pre-fail Always - 2 198 Offline_Uncorrectable 0x0003 100 100 000 Pre-fail Always - 1 199 UDMA_CRC_Error_Count 0x0003 100 100 000 Pre-fail Always - 0 232 Available_Reservd_Space 0x0003 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0003 100 100 000 Pre-fail Always - 6425 241 Total_LBAs_Written 0x0003 100 100 000 Pre-fail Always - 214612 242 Total_LBAs_Read 0x0003 100 100 000 Pre-fail Always - 214201
This is a rather large and nasty table, and it didn’t seem that Plextor implemented any meaningful values for the thresholds. Nonetheless, raw data counts were available. The first alarm bell was metric number 184, End-to-End Error (which sounds terrible); that had apparently happened over 26 million times. Some sources suggest this is critical while others do not; I don’t have historical data regarding the progression of this figure, so it would be difficult to draw conclusions as to how this happened – or, if the sources saying this is a critical metric are correct, how the disk limped to this point in the first place.
Nonetheless, there were other negative indicators as well; 187, 188 and 198 which have been associated with disk failures were all notably more than zero. There were several other ATA errors appearing in the smartctl output as well.
For comparison, I ran the diagnostic tools on my HDD:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0 3 Spin_Up_Time 0x0007 118 118 033 Pre-fail Always - 2 4 Start_Stop_Count 0x0012 098 098 000 Old_age Always - 4642 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0 9 Power_On_Hours 0x0012 084 084 000 Old_age Always - 7425 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 2574 191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 32 193 Load_Cycle_Count 0x0012 072 072 000 Old_age Always - 283355 194 Temperature_Celsius 0x0002 146 146 000 Old_age Always - 41 (Min/Max 10/58) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 1 223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
Much better. I looked up 199 just in case, but it seemed fine (just one occurrence; and in any case the super important data has been backed up). Notice 197, 198 and 5 (bad sector reallocations) are all zero.
The current situation is fine (the slower startup times are a little unpleasant, though would probably have been worse with Windows). I might investigate replacing the disk and/or getting a new machine (this laptop is reaching 3 years, so an upgrade would be nice) when my budget allows for it. That said, my usage patterns don’t seem to suggest a higher-end machine is necessary at all, so I might stick with it (Sims 4 isn’t that demanding; the most demanding thing I played would probably be Fallout 4 and while I could use a GPU upgrade, that’s clearly a want, not a need). I haven’t quite decided yet.