How to blacklist a correct bad RAM sector according to MemTest86+ error indication?

linux kernel memory ram

36,200

Solution 1

memmap

There is this tutorial titled: Bad Memory HowTo which discusses disabling memory via the kernel using the memmap argument to the kernel. According to the howto you have 2 options when it comes to memmap:

Turn off everything after the bad memory - (mem=###M option)
Turn off just the memory around the bad memory - (memmap=#M$###M option)

With the first option, if memtest reports that there is bad memory at 600M then you could disable the RAM from that point up until the end of RAM with this:

 mem=595M

If there's bad RAM at 802M and 807M, you can disable a 10M section of RAM starting at 800M like this:

memmap=10M$800M

NOTE: This will blacklist the 10M after the 800M base address. You should run memtest86+ afterwards to confirm that this argument is correct.

BadRAM

There is a patch available for Ubuntu called BadRam. It's covered very well here in this post titled: BadRAM on the Ubuntu Community site.

After applying the patch to the kernel using the details from that page you make modifications to your Grub2 setup:

excerpt from that site for Grub2

The GRUB2 config file in Natty has a line for configuring kernel bad ram exclusions. So, I will assume that is the preferred way of mapping out a section of memory that is showing errors. The line I set was

GRUB_BADRAM="0x7DDF0000,0xffffc000"

The suggested way on every web site I could find was to set this was to run memtest86 and let it show you the BadRAM settings. memtest86 gave me a page of stuff I would have had to enter. I could see that all the addresses were in one 16K block, so I just wanted to map that 16K block out of action. Here is how I generated the correct entry.

The first parameter is easy. That is the base address of the bad memory. In my case, I could see that all the bad addresses were greater than 0x7DDF0000 and less than 0x7DDF4000. So, I took the beginning of the 16K block as my starting address.

The second parameter is a mask. You put 1s where the address range you want shares the same values and 0s where it will vary. This means you need to pick your address range such that only the low order bits vary. Looking at my address, the first part of the mask is easy. You want to start with 0xffff. For the next nibble, I will explain with bit maps. I want to range from 0000 to 0011. So, the mask for badram would be 1100 or a hex c. The last 3 nibbles need to be all 0s in the mask, since we want the entire range mapped out. So, we get a total result of 0xffffc000.

After setting this line in /etc/default/grub, I ran sudo update-grub and rebooted and my bad memory was no longer being used. No kernel patches are needed to map out bad memory using this method.

Follow up #1

Looking through the wikipedia page for memtest86+ it states as follows:

excerpt from Memtest86 wikipedia page

Starting from Memtest86 2.3 and Memtest86+ 1.60, the program can output a list of bad RAM regions in the format expected by the BadRAM patch for the Linux kernel; using this information, a Linux system can reliably use a RAM module even if it has a few bad bits. Grub2 is able to supply this same information to an unpatched kernel, negating the need for the BadRAM patch.

Also I came across this Gentoo page which specified the memmap=... using a hex address, so you could specify it like this:

memmap=5M$0x2f796c48

The 5M is just a guess, obviously you could adjust it lower or higher depending on how much RAM around that region you want/need to omit.

Finally you can specify the size in hex as well:

memmap=0x10000$0x2f796c48

Would ignore 64KB's starting at address 0x2f796c48.

References

Solution 2

Memtest86+ (I used 4.20) can output a badram format directly.

Press 'c' to reach the configuration dialogue
Then '4' for "Error Report Mode"
Then '3' for "BadRAM Patterns"

The output will change from a list of individual test failures to a series of badram= lines, each containing one more new bad sector. Because the lines append and coalesce adjacent segments you can just run the test headless overnight and use the final printed line (though if you have a really bad dimm the less-accurate "5 megs around this point" format will likely be quite a bit shorter).

Final result:

Memtest86+ showing badram output

Solution 3

Very dirty and very nice work-around: run a user space memtester, wait until it finds an error. Let it, for example, at 0xfce2ea31 .

Then run again memtester, but on that physical address, so:

memtester -p 0xfce20000 64k 128

To be sure, it is better if you sacrifice more than the page of the problematic address. Here we sacrificed 64kByte around the faulty address.

If all went well, it will find the faulty memory location, far more quickly, again.

Then suspend the memtester process with a ctrl/z.

Consequence: until the memtester process is suspended, it won't take away more resource, but no other process will be able to access the faulty memory. Because it will be allocated by the memtester.

Particularly useful on big, remote servers. The suspended process can stay until the new RAM is not shipped. Or maybe until the next christmas, when a downtime won't be so big problem.

Solution 4

Starting with Linux 2.6 you can build your kernel with CONFIG_MEMTEST=y

Booting with "memtest" on the kernel command line will then run a quick test of system RAM on every bootup and automatically exclude obvious bad spots.

This won't catch everything since memtest86 is a lot more thorough and runs multiple passes, but it will likely work in the majority of situations, and has the advantage of not requiring manual intervention should you lose another few sectors.

View more solutions

36,200

Ivan

Currently I live in Prague, CZ, use Arch Linux on my Toshiba L10 (Centrino "Dothan" 1.6 Mhz) laptop and code (am beginning, actually) Scala 2.8 with NetBeans 6.9. I like Scala very much (finally, the language I really like) and wouldn't mind to get a jr. Scala developer position.

Updated on September 18, 2022

Comments

Ivan over 1 year
MemTest86+ (the version included with Ubuntu 13.04) says
```
Failing address: 002f796c48 -    759.5 MB
```
What should I specify in the memmap kernel parameter to bypass this area?

I've tried running memtester 770MB and it says everything is ok so it doesn't look that the MemTest's indications means an error in the 759.5th MB from the start.

How to interpret this MemTest indication to configure memmap?

I have no money to buy new RAM now and the error seems to be single so I hope I can just override it.
- Bratchley about 11 years
  
  FWIW, the kernel will mark certain pages as being "reserved" if it detects a bad segment but is able to recover. Does the output of "free -m" show powers of two for the totals? I mention this as a way of explaining why memtester can't see the bad RAM but memtest86+ can.
- Ivan about 11 years
  
  Doesn't look like powers of two actially: i.stack.imgur.com/l86L1.png
- psusi about 11 years
  
  By the time an error is detected ( if you even have ecc ram ), it is generally too late. Also free -m never reports an even power of two as the bios and kernel both reserve some ram.
- Bratchley about 11 years
  
  Looks like the kernel also printk's when it finds a bad page (line 264-265).
- frostschutz about 11 years
  
  How much RAM do you have in total? memtester 770MB doesn't test the first 770MB, but any 770MB it could allocate. Whatever other RAM is still free then isn't tested. The address provided by memtest86+ should be reliable so memmap that if anything.
- Ivan about 11 years
  
  The total is 4 GiB, @frostschutz . I've already located the bad DIMM and replaced it actually but the answer to the question still seems interesting to know so I am not going to delete the question.
- slm about 11 years
  
  Looks like this question was cross posted on SU: superuser.com/questions/592870/…
- Hauke Laging about 11 years
  
  Does slm's solution work for you?
- Ivan over 10 years
  
  I ask for Windows solutions there and for Linux solutions here, @slm
- Ivan over 10 years
  
  I had already got rid of the bad RAM to the time the answer was submitted, @HaukeLaging, so I couldn't check it out (I only approve answers I have tested). I shall test it next time I find a PC with corrupt RAM.
Hauke Laging almost 11 years

"800M to 804M" is supposed to be "800M to 810M" I assume...
slm almost 11 years

It can be but what I wrote is OK too, even though it's throwing away more memory than the 4M between 800M to 810M.
Ivan almost 11 years

1. I know about the memmap option but the question is more about how to interpret the memtest86+ output. I have given a specific example of memtest86+ output and ask for help in configuring memmap accordingly in this particular case. 2. "You should run memtest86+ afterwards to confirm that this argument is correct." - memtest86+ runs before an OS kernel so I seriously doubt the memmap Linux kernel option can affect it.
slm almost 11 years

@Ivan, 1. I thought it was obvious given the examples I included, but you'd need to say something like: memmap=5M$759M for your particular case, given memtest86+ is failing at 759.5MB. 2. I meant that you should pass the memmap=... option to memtest86+ as well. That was untested/unconfirmed by me but something that you may be able to do with memtest86+.
slm almost 11 years

This SU Q&A explains the output better than I could: superuser.com/a/326099/20568
Ehtesh Choudhury over 10 years

Now if I didn't have to copy that over by hand and instead hand it over to GRUB without retyping errors, that would be fantastic.
eMPee584 almost 10 years

What I did is take a photo of it (camera phone), load it up into GIMP, => grayscale => invert => contrast/gamma then hand it to tesseract ${IMG} stdout .. then verified and corrected the line before inserting into /etc/default/grub ... Probably took just as long as manually entering it straight away^^
Petr over 8 years

memmap only accepts megabytes, providing address result in parser error and kernel lockup.
Petr over 8 years

+ memtest doesn't seem to accept linux parameters, giving it memmap doesn't do anything, it still tests the memory like if nothing was told to it.
flying sheep over 5 years

Definitely more fun than doing it manually though
TooTea over 3 years

Instead of this trick, you can also use the chmem tool in util-linux to tell the kernel to take a particular memory range offline (moving the data elsewhere and then never reusing the pages).
peterh over 3 years

@TooTea I tried this tool on multiple machines, and it could not deactivate a single memory block.
K-att- over 2 years

Thanks a lot. It's nice.