On system memory... specifically the difference between `tmpfs,` `shm,` and `hugepages...`

linux filesystems memory tmpfs shared-memory

10,158

Solution 1

There is no difference betweem tmpfs and shm. tmpfs is the new name for shm. shm stands for SHaredMemory.

The main reason tmpfs is even used today is this comment in my /etc/fstab on my gentoo box. BTW Chromium won't build with the line missing:

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for 
# POSIX shared memory (shm_open, shm_unlink). 
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0

which came out of the linux kernel documentation

Quoting:

tmpfs has the following uses:

1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
memory.

This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal
mechanisms are always present.

2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this:

tmpfs /dev/shm tmpfs defaults 0 0

Remember to create the directory that you intend to mount tmpfs on if necessary.

This mount is not needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
shared memory)

3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
loop mounts of tmpfs files do work, so mkinitrd shipped by most
distributions should succeed with a tmpfs /tmp.

4) And probably a lot more I do not know about :-)

tmpfs has three mount options for sizing:

size: The limit of allocated bytes for this tmpfs instance. The default is half of your physical RAM without swap. If you oversize your tmpfs instances the machine will deadlock since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a machine with highmem) the number of lowmem RAM pages, whichever is the lower.

From the Transparent Hugepage Kernel Doc:

Transparent Hugepage Support maximizes the usefulness of free memory if compared to the reservation approach of hugetlbfs by allowing all unused memory to be used as cache or other movable (or even unmovable entities). It doesn't require reservation to prevent hugepage allocation failures to be noticeable from userland. It allows paging and all other advanced VM features to be available on the hugepages. It requires no modifications for applications to take advantage of it.

Applications however can be further optimized to take advantage of this feature, like for example they've been optimized before to avoid a flood of mmap system calls for every malloc(4k). Optimizing userland is by far not mandatory and khugepaged already can take care of long lived page allocations even for hugepage unaware applications that deals with large amounts of memory.

New Comment after doing some calculations:

HugePage Size: 2MB
HugePages Used: None/Off, as evidenced by the all 0's, but enabled as per the 2Mb above.
DirectMap4k: 8.03Gb
DirectMap2M: 16.5Gb
DirectMap1G: 2Gb

Using the Paragraph above regarding Optimization in THS, it looks as tho 8Gb of your memory is being used by applications that operate using mallocs of 4k, 16.5Gb, has been requested by applications using mallocs of 2M. The applications using mallocs of 2M are mimicking HugePage Support by offloading the 2M sections to the kernel. This is the preferred method, because once the malloc is released by the kernel, the memory is released to the system, whereas mounting tmpfs using hugepage wouldn't result in a full cleaning until the system was rebooted. Lastly, the easy one, you had 2 programs open/running that requested a malloc of 1Gb

For those of you reading that don't know a malloc is a Standard Structure in C that stands for Memory ALLOCation. These calculations serve as proof that the OP's correlation between DirectMapping and THS maybe correct. Also note that mounting a HUGEPAGE ONLY fs would only result in a gain in Increments of 2MB, whereas letting the system manage memory using THS occurs mostly in 4k blocks, meaning in terms of memory management every malloc call saves the system 2044k(2048 - 4) for some other process to use.

Solution 2

To address the "DirectMap" issue: the kernel has a linear ("direct") mapping of physical memory, separate from the virtual mappings allocated to each user process.

The kernel uses the largest possible pages for this mapping to cut down on TLB pressure.

DirectMap1G is visible if your CPU supports 1Gb pages (Barcelona onwards; some virtual environments disable them), and if enabled in the kernel - the default is on for 2.6.29+.

Solution 3

There's no difference between shm and tmpfs (actually, tmpfs is only the new name of former shmfs). hugetlbfs is a tmpfs-based filesystem that allocates its space from kernel huge pages and needs some additional configuration afford (how to use this is explained in Documentation/vm/hugetlbpage.txt).

10,158

mikeserv

hi. im a us army disabled veteran with an attitide and a tendency toward destitution. stackexchange helped me to learn quite a lot at the cost of many thousands of hours of applied research and practice. what you find here authored of me is not the work of a paid professional; im just this guy, you know? i subsist on military disability pay and try to get arrested as seldom as possible. if ive helped you, buy me something. send me a gift card. eat my shorts. im michael holmes hodgin, and i approve this message. thanks for stopping by. https://unix.meta.stackexchange.com/a/2858/52934 mikeserv.net [email protected] [email protected] facebook.com/mikeserv (artwork courtesy of drobert, who gave me the holy ghost avatar)

Updated on September 18, 2022

Comments

mikeserv over 1 year
I've been curious lately about the various Linux kernel memory based filesystems.

Note: As far as I'm concerned, the questions below should be considered more or less optional when compared with a better understanding of that posed in the title. I ask them below because I believe answering them can better help me to understand the differences, but as my understanding is admittedly limited, it follows that others may know better. I am prepared to accept any answer that enriches my understanding of the differences between the three filesystems mentioned in the title.

Ultimately I think I'd like to mount a usable filesystem with hugepages, though some light research (and still lighter tinkering) has led me to believe that a rewritable hugepage mount is not an option. Am I mistaken? What are the mechanics at play here?

Also regarding hugepages:
```
     uname -a
3.13.3-1-MANJARO \
#1 SMP PREEMPT \
x86_64 GNU/Linux

    tail -n8 /proc/meminfo
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     8223772 kB
DirectMap2M:    16924672 kB
DirectMap1G:     2097152 kB
```
(Here are full-text versions of /proc/meminfo and /proc/cpuinfo )

What's going on in the above? Am I already allocating hugepages? Is there a difference between DirectMap memory pages and hugepages?

Update After a bit of a nudge from @Gilles, I've added 4 more lines above and it seems there must be a difference, though I'd never heard of DirectMap before pulling that tail yesterday... maybe DMI or something?

Only a little more...

Failing any success with the hugepages endeavor, and assuming harddisk backups of any image files, what are the risks of mounting loops from tmpfs? Is my filesystem being swapped the worst-case scenario? I understand tmpfs is mounted filesystem cache - can my mounted loopfile be pressured out of memory? Are there mitigating actions I can take to avoid this?

Last - exactly what is shm, anyway? How does it differ from or include either hugepages or tmpfs?
- Gilles 'SO- stop being evil' about 10 years
  
  What about the previous lines in /proc/meminfo that contain HugePage (or does your kernel version not have these)? What architecture is this on (x86_64 I suppose)?
- mikeserv about 10 years
  
  Ill add them. I was just worried about it being too long.
- mikeserv about 10 years
  
  @Gilles - I've linked to plain text above. I hope that's ok. Thanks for asking - I should have included it in the first place - I don't know how I missed that.
mikeserv about 10 years

This was a good try, and I had read those docs, of course. Or maybe not of course - but I think I'm going to put this out for a 100rep bounty, but before I do, I will offer it to you if you can expand on this. So far you've yet to enrich my understanding - I already knew most of it, except that the two were merely synonyms. In any case, If you can make this a better answer by tomorrow morning the 100rep bounty is yours. Especially interesting to me is I find no mention of DirectMap at all in the procfs man page. How come?
mikeserv about 10 years

This is really good- is the THS my DirectMap?
eyoung100 about 10 years

That I can't answer as I googled DirectMapping and found nothing related to tmpfs etc. The only thing I could find was how to configure HugeMem Support for Oracle Databases running on their flavor of Linux, which means they are using HugePages instead of the THS I referred to. All kernels in the 2.6 branch support THS though. As a hunch tho, see my new comment above.
mikeserv about 10 years

Yeah I turned up very little as well. I have done some reading on HP, THP. I'm pretty intrigued by your comment. This is really shaping up, man. This last part - HP only - should I interpret this to mean that I can mount a read/write filesystem atop a hugepage mount? Like, an image file loop-mounted from a hugepage mount? Writable?
eyoung100 about 10 years

Yes, and it is writable when mounted properly, but be aware: 1. That since you mounted it, you're in charge of cleanup 2. It's wasteful: Using your example, lets say that your loop only contained a text file, with the Characters: Hello, my name is Mike. Assuming each character is 1k, that file will save as 23k. You've wasted 2025k as the Hugepage gave you 2MB's. That wasteful behavior is why memory management was built into the kernel. It also prevents us from needing a wrapper DLL like kernel32
eyoung100 about 10 years

and lastly 3. You lose your mount upon reboot or crash.
mikeserv about 10 years

But what about the possible gains that can be had in using larger blocksizes - particularly with btrfs? If we're limited to 4k node sizes it restricts possible performance gains - especially for smaller images in mixed-metadata mode. This isn't a test or anything - your response won't negatively affect my judgement of your already excellent answer in any way. This is just discussion. But look at this: unix.stackexchange.com/questions/123250/…
mikeserv about 10 years

Don't you think that a tail-packing fs like btrfs and/or sqaush could stand to gain from larger page sizes?
eyoung100 about 10 years

You miss the idea behind page size, Take a look at the BTRFS Wiki I dont know what tail packing is, but I don't need to for this to make sense. The purpose of optimizing a file system is AN INVERSE RELATIONSHIP. I can cram more crap into a smaller area if my page size is smaller, while optimizing space on a huge disk. ie that 32k file takes 16 pages to write to disk with no wasted page size if the PageSize is 2k, which is bullet 2 on the wiki. At the same time, Ive optimized for huge disks. 2TB ex is divisible by 2k w/ no remainder
mikeserv about 10 years

Maybe im misunderstanding this but: comments.gmane.org/gmane.comp.file-systems.btrfs/19440 Also, tail-packing: en.m.wikipedia.org/wiki/Block_suballocation
eyoung100 about 10 years

My comment above and your tail packing wiki are saying the same thing. Think of it this way: If the 2 of us lived in a huge house and we wanted to talk to each other effectively, we would each reduce the distance between us by coming closer together. In this example, the distance is Page size. The Size of the house has no bearing on how effectively we minimize the distance inside it, because for each house size we would minimize the pagesize by colapsing the distance in between
mikeserv about 10 years

Weird. I'm getting the opposite: For example, if a 38 KiB file is to be stored in a file system using 32 KiB blocks, the file would normally span two blocks, or 64 KiB, for storage; the remaining 26 KiB of the second block becomes unused slack space. With a 8 KiB block suballocation, however, the file would occupy just 6 KiB of the second block, leave 2 KiB (of the 8 KIB suballocation block) slack and free the other 24 KiB of the block for other files. And i think the data in the first link points to optimal pages at around 16 or 32kb.
eyoung100 about 10 years

You aren't getting the opposite. We are still saying the same thing. Filestorage is an inverse relationship. In order for 8kb to be an optimal suballocation the files using it must be small, as your example shows. The bigger the files get, the less efficient the suballocation becomes. In order to overcome the less efficient suballocation you must increase the page size. 16 or 32 is the optimum page size for a disk with both large and small files as its balanced. I know that inverse relationship is hard to grasp.
mikeserv about 10 years

Now i think we're coming together. Especially the btree type filesystems - their biggest drawback is their amount of metadata. Most attempt to compensate for this by inlining the smaller files into the metadata tree - but when the page size is too small this happens less frequently. And this is why im curious about using a hugepage mount - I think that, given the right filesystem, it could significant advantages. Obviously im not talking 2gb pages, here, but my AMD should handle 64kb page sizes, i think.
eyoung100 about 10 years

Read all 5 of these, starting here