I put my computer to sleep every night so the next morning I can quickly resume work right where I left off.
But every 2-3 weeks the PC would reboot while waking from S3-sleep. My database and web-server VMs would not get a chance to shutdown cleanly. It was a problem.
I tried common advice from various forums; Spent weeks swapping out RAM, burning DIMMs in with memtest86, switching operating systems (various Ubuntus, Windows 7 and 8.1), toggling ACPI, tweaking voltages, underclocking, trying every memory profile permutation I can think of, changing drivers, and even contacting my motherboard’s (Gigabyte) tech-support (a last-ditch-effort for a programmer). Normally one of these methods is enough as BSODs are often result of unstable overclocking or bad memory timings. But this was something else.
After eliminating the obvious suspects, I resorted to desperate measures.
Disabling BSOD-induced Reboots
First thing, disable auto-reboot and enable coredumps (aka minidumps). This way there is something to analyze.
Go to System (Win + Break) > Advanced System Settings > Startup and Recovery >
Disable “Automatically restart”. Ensure minidumps are enabled by selecting either “Kernel memory dump” or “Small memory dump”.
Wait until there is a crash or two. Until you have something to look at. If your screens are blank and no BSOD appears, the memory dumps will provide the insight.
This quick and easy tool by NirSoft reveals the general cause of the last few system crashes. In this case it’s Bug Check 0x116 — aka TDR — aka my video card failing to recover from sleep:
*Optional* Alternatives to BlueScreenView
You could also upload a minidump to this wonderful online analyzer built by OSRonline. Or if feeling particularly adventurous, perform manual WinDbg analysis with these guides):
Results
I learned 3 things:
- The reboot coming out of S3-sleep was caused by Windows 8.1’s default option of “Automatically restart on crash.” We can disable it.
- The memory dumps revealed the source of the crash to be “Bug Check 0x116” — aka the video card hung and failed to recover.
- We weren’t seeing the BSODs because the video card crashed!
We can configure Windows to allow the video-card to fail unrecoverably without resulting in a BSOD or reboot. You won’t have a desktop to work with, but that’s okay.. we can remote in.. :)
Disabling BSOD after GPU TDR failure
Once you know TDR is at fault, tweak the TDR Registry Keys and give your PC a chance to survive the hardware/drivers failure:
TdrDebugMode: TDR_DEBUG_MODE_RECOVER_UNCONDITIONAL TdrDdiDelay: 25 TdrDelay: 7 TdrLevel: 3
After these tweeks, my ATI Radeon HD5850 didn’t cause the system to reboot. It merely corrupted the screen:
The webpage I was designing continued to function on my phone. This meant the web-server VM inside the PC was functioning. Though my screens were corrupt, the OS was still running!
New lesson learned: these crashes were isolated to the GPU.
Taking back the PC
GNU/Linux is easy to work with. It treats you like king. I sshed into and shutdown the web-server guest in my PC.
Windows machine on the other hand aren’t that easy to work with. By default all meaningful access is revoked. Even with your admin-level account, there is no ssh server to connect to, the samba shares only let you write files. The RPC/WMI is locked down. You are treated like you don’t own the machine.
But we can change that with some instructions borrowed from MSFT:
1) Disable Remote UAC Restrictions.
By default, your user account loses Administrative privileges when connecting remotely.
Override this by creating a registry key LocalAccountTokenFilterPolicy with a value of 1 in HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Policies\System
2) Option A: Enable Remote Desktop
Enable RDP using these instructions.
RDPing into a GPU-crashed PC is possible. After you RDP in, try to open up the task manager and kill any running DWM.exe instances and restart explorer.exe. I’ve seen this bring the PC desktop back to life!
Some applications will not function properly, Photoshop may have to be killed, but OneNote should work after restarting explorer.
or Option B: Enable Remote Console via Built in Telnet Server
There are severals options for connecting to your graphics-card-disabled Windows PC via console; mostly awful. You can try compiling WEF or using impacket’s psexec.py to “break” into your machine using your existing login to remotely execute commands and create a new Windows Service that allows System-Level terminal-like access to your machine. You will want to cleanup after by removing the created service and deleting temporarily created executables (typically in your c:\Windows folder). This is not for the lighthearted.
The slightly easier option is setting up a legitimate service that comes with Windows:
Enabling Telnet Server
Open up Windows Features, check the Telnet Server box:
Grant your user account login-access to Telnet Server
Open up Computer Management (Win + X, G), Local Users and Groups > Groups
double click on TelnetClients, and Add your main user.
Grant Telnet System-Level Access
Allow your user account to perform more adminy things within Telnet by giving Telnet Server itself more privileges.
Add the string “SeTcbPrivilege” to a key called RequiredPrivileges within HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/TlntSvr
Go to Services > Telnet > Log On tab, and make sure “Local System account” is toggled on.
This combination of Local-System account for Telnet + LocalAccountTokenFilterPolicy=1 restores Full Access to your user account while connecting remotely.
Now you can telnet into your GPU-hung Windows box (Windows 8.1 for sure, possibly 7 too) and perform admin tasks like tasklist, taskkill, shutdown, etc. This will let you close programs and shutdown the machine in a way that reduces risk of data-loss compared to the default system crash.
If ATI (AMD) provided source-code to its video card firmware and drivers, we could simply fix the bugs at the source and stop these crashes from happening. Let’s keep on dreaming.