Fighting a Instability Issue with a Windows Gaming PCFebruary 23, 2020
Recently, using the bones of a friends gaming PC (namely the case, motherboard, PSU, and a couple of HDDs) I built a new gaming PC. Unfortunately, after putting it all together I've suffered from a number of strange instability issues that have been difficult to track down. This is a glimpse into the rabbit hole I've been down this past week or so.
I've been pulling apart and putting together computers for the best part of my life but this build was probably the trickiest I've ever done. A combination of poor planning, using parts I wasn't familiar with 1, and a CPU that wasn't compatible ‘out of the box’ with my CPU 2 made this a hellishly lengthy and painful process.
Even after the build was complete - the fun and tinkering didn't stop. Playing games that were barely demanding (such as Heroes of the Storm) it crashed or hung a number of times 3, even once on a menu screen. I also had troubles with the computer becoming unresponsive dragging or resizing windows, or just idling on Steam Big Picture mode4.
I initially wondered if this was a temperature problem due to the rather awkward positioning of my case - so I installed some additional case fans 5 and started keeping an eye on temperatures with Libre Hardware Monitor 6. The temperatures I was seeing didn't really point to a cooling problem and as you might expect the additional cooling didn't seem to have much of an impact.
At this point I started suspecting a hardware flaw, so I decided to run stress tests to identify which hardware part might be failing. These tests would push my system to its limits and check for instability or incorrect calculations - I'm not particularly clued up on which stress tests are most beneficial, there seem to be a lot of complaints about how useful Prime 95 is for instance - I decided to run them all and see what they'd throw up.
The first candidates I aimed my guns at was RAM, simply because it was relatively easy to isolate and there were a lot of tools for the job. I ran the Prime 95 test - which stresses both RAM and CPU, the Windows Memory Diagnostic which solely checks RAM, and Memtest 86 7. I also threw in realbench on the off chance it was the CPU, not that I considered it likely.
Given the error messages (pointing to an error communicating with the graphics card) and the lack of success running tests so far I decided that the flaw was likely to be in the motherboard, the graphics card, or the PSU and decided to focus my efforts there. I moved to using Furmark to test the GPU and bought a copy of 3DMark as well, give it was billed as a more realistic workload - neither the Time Spy or Fire Strike benchmark or stress tests proved helpful there. After pushing the GPU to its limit, and thinking back on the errors it seemed more likely that the issue was software related or a flaky motherboard, possibly a bad PCI Express lane.
I decided on a whim to check out if there was a BIOS update for my motherboard, the MSI Tomahawk B350, and lo and behold it appeared there was. There was a BETA BIOS marked as being released around Christmas, after I built my PC. However I couldn't track down a BIOS release that seemed to match the version I had installed (7A34v1OM) - the latest BIOS listed seemed to be newer (7A34v1OR) but I wasn't sure it was worth the risk. Updating a BIOS is very hard to undo if it goes wrong (typically your PC won't boot) so it's generally advised to try it unless you think it's going to work. The notes for the new BIOS mention “Improved PCI-E device compatibility”, which sounded like it might help, if not a bit of a long shot, so I figured I'd give it a go. To update the BIOS I had to use MSI's M-Flash8 which didn't feel like a rock solid procedure. When you enter flash mode it's meant to display something that resembles the name of the USB device you're using - instead mine appeared as
<null string> which isn't the most reassuring display! I figured it was likely a small glitch with the BETA BIOS I was using and braved it, eventually flashing my new BIOS.
I tried to cause a crash with the BIOS by leaving Heroes of the Storm running in the background but didn't have much luck. Looking back at the Heroes of Storm error logs it'd been a few days between each incident so I didn't feel 100% sure the BIOS update had fixed anything. On a whim, I decided to look at the Windows Event log to see if it's possible errors were being logged there. This hunch paid off and I spotted a reoccurring item from “WHEA-Logger” which seemed to be a PCI Express error. A result in Google suggested trying to match the error codes in the log to devices in Device Manager - whilst I wasn't able to get a good match it was clear from the tree view that my PCI devices seemed to be poorly arranged. On another hunch, I decided to see if there were new drivers for my motherboard - and there were! After I installed the AMD Chipset Driver and restarted, no logs have been seen since.
I'm not really sure what the lesson is here - possibly “Don't use incompatible motherboards that rely on a BETA BIOS”, or possibly “Don't assume it's a hardware fault without double checking the software side”, or even more likely “Don't imagine it's a cooling problem without data”.
At least it made for an interesting blog post!
A bunch of the parts I got from my friend were poorly identified in the original message or identified wrong. For a long time I thought my case was a NZXT H440 Elite. In reality there is no such case, instead it was the NZXT S340 Elite. ↩︎
The Ryzen 3000 series (like my Ryzen 5 3600x) was physically supported by my motherboard, the MSI Tomahawk B350, but required a BETA BIOS update. In order to update it, I needed to use a CPU that was supported so I could download and flash the BIOS. This theme comes back later to haunt me! ↩︎
If you're wondering (or trying to Google this problem) the Heroes of Storm error logs are found in your user profile's Documents folder. For example
arranf/Documents/Heroes Of the Storm/GameLogs/, inside that there are date stamped error folders. In there I found my error
e_gfxErrorAPIError- Graphics device reboot queued to d3d failure. ↩︎
This error was labelled as
I opted for be quiet! Silent Wings 3 fans. I had a stupid mix up and now I have 2x 120mm and 2x 140mm case fans instead of the typical 1x 120mm and 3x 140mm. I don't think this will make a huge difference long term. If anything I need to get a radiator and better CPU cooling looking at the temperatures! ↩︎
Looking online people were recommending Open Hardware Monitor for checking temperatures but I wasn't able to read the temperature of my CPU using it. It appears the maintainer no longer works on it, instead there's a new fork Libre Hardware Monitor which supports my Ryzen CPU as well as a ton of other device information I didn't want or care about! ↩︎