Diagnosing a Faulty Disk

Recently, I had a lot of difficulty booting into Windows on my home laptop. More specifically, I could boot, but the OS would very quickly become unresponsive. However, I had no such issues booting into Ubuntu (and I’m typing this from that). I tried booting into Safe Mode and disabling auxiliary services and devices, but that was to no avail.

Nonetheless, I could use Ubuntu to perform some analysis. I noticed rather unpleasant-looking logs in the Linux message buffers (a.k.a. dmesg).

Hmm, UNC stands for uncorrectable; not great. These errors were easily reproduced. Upon finding that I could nonetheless seem to view most of the contents of the file-system when mounting the SSD in Linux, I immediately took a backup of the important data to an external hard drive, though this was probably objectively unnecessary (really important data is backed up in the cloud).

My first instinct was then to carry out a self-test of the drive, using its SMART (self-monitoring, analysis and reporting technology) tools. I first performed an extended “self-test” of the drive, which seemed to yield suspiciously positive results.

I decided to probe further. The tool’s output includes a table of various attributes which can be metrics of a drive’s health:

This is a rather large and nasty table, and it didn’t seem that Plextor implemented any meaningful values for the thresholds. Nonetheless, raw data counts were available. The first alarm bell was metric number 184, End-to-End Error (which sounds terrible); that had apparently happened over 26 million times. Some sources suggest this is critical while others do not; I don’t have historical data regarding the progression of this figure, so it would be difficult to draw conclusions as to how this happened – or, if the sources saying this is a critical metric are correct, how the disk limped to this point in the first place.

Nonetheless, there were other negative indicators as well; 187, 188 and 198 which have been associated with disk failures were all notably more than zero. There were several other ATA errors appearing in the smartctl output as well.

For comparison, I ran the diagnostic tools on my HDD:

Much better. I looked up 199 just in case, but it seemed fine (just one occurrence; and in any case the super important data has been backed up). Notice 197, 198 and 5 (bad sector reallocations) are all zero.

The current situation is fine (the slower startup times are a little unpleasant, though would probably have been worse with Windows). I might investigate replacing the disk and/or getting a new machine (this laptop is reaching 3 years, so an upgrade would be nice) when my budget allows for it. That said, my usage patterns don’t seem to suggest a higher-end machine is necessary at all, so I might stick with it (Sims 4 isn’t that demanding; the most demanding thing I played would probably be Fallout 4 and while I could use a GPU upgrade, that’s clearly a want, not a need). I haven’t quite decided yet.

