Can a faulty graphics card cause OS corruption?

5,403

Overview / Preliminary Discussion

RAM is almost definitely what needs to be getting blamed in this case.

(In theory, a bad bus (communication pathway on the motherboard) or a bad CPU could cause such things. However, in practice, bad RAM happens at far greater frequency than those things. The only way to test that would be if you tried different RAM chips and found that the same hardware keeps reporting tested-good RAM chips as bad. A bad PSU could also lead to certain types of troubles.)

It is not surprising at all that some software may trigger problems more than other software. This can often be the case due to reasons like how much parallel threading a program has used in its design. It is not uncommon at all for games to use hardware heavily, thereby making games particularly prone to exposing actual problems. The problems are often exasperated by the internal design of the software, and different software creators may use different technical processes, so therefore it is not even uncommon at all for one game to show problems, while another similar-looking game doesn't show the same problems. (What the game looks like, e.g. if the game is a "first person shooter", is can be a good basis to try to form conclusions about whether certain types of problems are likely to be similar, but it is not always a good basis.)

So, other than historical trends of RAM being more likely, why should we be prone to blame bad RAM? We have two reasons.

It matches the experienced problems (very well)

Bad RAM can affect what the computer understands when it reads from files. Worse, bad RAM can affect what the computer thinks should be written to disk, leading the system files. So this explains your second symptom.

Bad RAM could also affect what the video card thinks should be drawn, and explains your first symptom.

So RAM is highly suspect, but the clincher is this:

You have evidence that RAM is the culprit

You may be leaning against the idea of trusting this evidence. I disagree. I believe this evidence should be trusted.

"RAM is pretty resilient, and it passes everything but the extreme hammer test in memtest"

When I've had bad RAM (unfortunately for me, I have), Memtest86 usually picks up on it the first pass. In some cases, it doesn't pick up until the 3rd or 4th pass. Rarely, it's picked up RAM errors on larger pass numbers, like 78 or 81 or 133. If Memtest86 picks up any errors, I do consider the RAM to be bad. If I'm on a machine that stores any files that have data that I care about, then I consider the bad RAM unsuitable. (I don't want my files to have incorrect data.) In theory, I might use a machine with bad RAM for something like a media server, a printer server, etc., where stability is less important to me and where I don't store any data that I wouldn't mind losing. In practice, this limitation ends up meaning that I have real no use for bad RAM.

However, I hadn't read Memtest86 documentation for a while, and wasn't familiar with this "extreme hammer test in memtest". So I checked it out.

Memtest86.com: Troubleshooting FAQ: "Why am I only getting errors during Test 13 Hammer Test?"

The text there is a bit of a lengthy answer (multiple screens), but I suggest reading it since it looks like this affects you. Most notably, I point out this sentence: "The errors detected during Test 13, albeit exposed only in extreme memory access cases, are most certainly real errors."

Share:
5,403
Chris Le Sueur
Author by

Chris Le Sueur

Graduate student in the areas of set theory and logic and the University of Bristol.

Updated on September 18, 2022

Comments

  • Chris Le Sueur
    Chris Le Sueur over 1 year

    I have a machine that has been damaged by some careless couriers, and want to replace the damage parts efficiently. I have limited opportunities to test components in other computers, so I'm trying to find out what is broken in other ways.

    I have two main issues:

    1. Graphical artifacts. These take the form of small grid-aligned squares which usually appear and then flicker form position to position. If the display driver doesn't crash, they often settle down to a final position, and sometimes the contents of the squares itself changes. This says to me that VRAM is being corrupted. Occasionally there are other artifacts, like polygon spikes, in games.

      This is affected by physically pushing on the graphics card: in particular, with the computer on its side, it usually goes away, which strongly suggests a graphics card error. However, it could also be the PCI-e slot or some part of the motherboard.

    2. Twice since the problems started, Windows has somehow been rendered unbootable and unsalvageable: each time some boot files were corrupt and SFC could not fix them (different errors and files each time) so I had to reinstall. The first time this occurred was following a BSOD which occurred after graphical artifacts while playing a game. The second time, the computer BSODed while I wasn't doing anything, but I wonder if it was installing updates.

    The thing is, I would quite like it to be the case, for the sake of my wallet, that these are caused by the same underlying phenomenon. So, my question is: is it reasonable to believe that graphics card damage could somehow cause system corruption (presumably by the display driver doing something whacky in kernel mode?) and/or is it reasonable to believe that some other kind of system damage, presumably to the motherboard, could cause very specific graphical artifacts and occasionally more general breakage?

    I should say that I strongly doubt the RAM is to blame (since we're talking physical damage and RAM is pretty resilient, and it passes everything but the extreme hammer test in memtest)

    I have disabled the graphics card and tested with on-board graphics. This gets rid of the graphical artifacts but does not rule out the slot, or motherboard circuitry related to the card, of course.

    I have checked for SMART errors on the disks but there are none. Of course that's not the be-all and end-all. Temperatures are all quite reasonable (CPU gets a bit toasty but it always has) and definitely not correlated to the artifacts or BSODs. I can run furmark/prime95 quite happily for ages with no ill effects. Specific games are more likely to trigger artifacts and driver crashes, presumably because they use the faulty circuitry more.

    • Admin
      Admin almost 8 years
      Couriers have insurance. They should be paying for it.
    • Admin
      Admin almost 8 years
      Is there any damage to the case? If not then you cannot blame the couriers for any damage. What makes you think the damage was caused by moving the machine?
    • Admin
      Admin almost 8 years
      I don't just make such accusations blindly. When the machine arrived, the case was junk, the cardboard box also being trashed. I already got the courier to pay for everything they would, which is the case - they don't cover electronics (unless they lose it) which I knew in the beginning. That said, even if I didn't have such obvious evidence, if a system is working, gets shipped and then develops problems that clearly stem from physical damage (i.e. fluctuate with manipulating the hardware), the courier is almost surely to blame.
    • Admin
      Admin almost 8 years
      @ChrisLeSueur - While I won't disagree with you, I have had a computer fail on after 2 weeks after being not plugged into the wall, a memory module just randomly stopped working. Typically a bad graphics card would cause graphical artifacts, system files becoming corrupt, wouldn't be caused by a bad graphics card.
    • Admin
      Admin over 7 years
      If your graphics card is causing a driver crash or some more sinister electrical problem, then your computer could shut down unexpectedly. Depending on what it was doing last, it could cause file corruption in the boot files or registry that may prevent Windows from booting properly. You will probably have to replace the card first to see if the other problems go away.
  • Chris Le Sueur
    Chris Le Sueur over 7 years
    This is unlikely. About 75% of current RAM fails the hard hammer test (I'd already read the docs) and, while it is a real error, it's very unlikely to be a real-world problem - so yes, that can be dismissed. And under no circumstances can bad RAM explain the entire problem: if it were the only cause, the RAM would be so bad as to render the machine unbootable, because of the frequency of visual artifacts. There's no doubt that something affecting only the graphics is damaged.
  • TOOGAM
    TOOGAM over 7 years
    Even if lots of RAM sticks fail the hard hammer test, that still doesn't invalidate the remainder of the conclusion. There is still the very good possibility that a portion of the RAM chip has become unreliable, and the operating system may be typically using a different portion of the RAM chip, enough that the system appears stable but the (often larger) portion that the game may be using is more affected. You may also want to try re-seating the RAM chips if you haven't tried that yet. (Even if probability is low, it can possibly be a free and rather fast fix.)
  • Chris Le Sueur
    Chris Le Sueur over 7 years
    If 75% of RAM fails the test, but few systems experience problems because of it, failing the test ceases to be indicative of the cause of my problem.
  • Chris Le Sueur
    Chris Le Sueur over 7 years
    Also, as far as I know, there is not even any reason to believe that memory used for the screen buffer etc. will even use the same physical addresses from boot to boot - whereas VRAM is VRAM. At the same time, I can physically manipulate the graphics card slightly and observe artifacts immediately. If this were due to a mis-seated DIMM, there would be more errors in memtest: test 13 detects charge leakage in the RAM itself; without other errors, that's likely the only problem with the RAM. If there's not a problem with the card itself, there is certainly a problem with the slot or something.
  • TOOGAM
    TOOGAM over 7 years
    "there is not even any reason to believe that memory used for the screen buffer etc. will even use the same physical addresses from boot to boot" -- very often, it does. Although some things can be random, so you shouldn't count on the same addresses, many things often end up working out the same (or similar). If many memory locations are randomized, boot code may be the least likely to be randomized, as it would need to move after booted. Despite all I've said so far, I admit, your results when moving the graphics card is compelling reason to suspect something other than the RAM chips.