How does stack allocation work in Linux?
Solution 1
It appears that the stack memory limit is not allocated (anyway, it couldn't with unlimited stack). https://www.kernel.org/doc/Documentation/vm/overcommit-accounting says:
The C language stack growth does an implicit mremap. If you want absolute guarantees and run close to the edge you MUST mmap your stack for the largest size you think you will need. For typical stack usage this does not matter much but it's a corner case if you really really care
However mmapping the stack would be the goal of a compiler (if it has an option for that).
EDIT: After some tests on an x84_64 Debian machine, I've found that the stack grows without any system call (according to strace
). So, this means that the kernel grows it automatically (this is what the "implicit" means above), i.e. without explicit mmap
/mremap
from the process.
It was quite hard to find detailed information confirming this. I recommend Understanding The Linux Virtual Memory Manager by Mel Gorman. I suppose that the answer is in Section 4.6.1 Handling a Page Fault, with the exception "Region not valid but is beside an expandable region like the stack" and the corresponding action "Expand the region and allocate a page". See also D.5.2 Expanding the Stack.
Other references about Linux memory management (but with almost nothing about the stack):
- Memory FAQ
- What every programmer should know about memory by Ulrich Drepper
EDIT 2: This implementation has a drawback: in corner cases, a stack-heap collision may not be detected, even in the case where the stack would be larger than the limit! The reason is that a write in a variable in the stack may end up in allocated heap memory, in which case there is no page fault and the kernel cannot know that the stack needed to be extended. See my example in the discussion Silent stack-heap collision under GNU/Linux I started in the gcc-help list. To avoid that, the compiler needs to add some code at function call; this can be done with -fstack-check
for GCC (see Ian Lance Taylor's reply and the GCC man page for details).
Solution 2
Linux kernel 4.2
-
mm/mmap.c#acct_stack_growth decides if it will segfault or not. It uses
rlim[RLIMIT_STACK]
which corresponds to the POSIXgerlimit(RLIMIT_STACK)
-
arch/x86/mm/fault.c#do_page_fault is the interrupt handler that starts a chain which ends up calling
acct_stack_growth
- arch/x86/entry/entry_64.S sets up the page fault handler. You need to know a bit about paging to understand that part: How does x86 paging work? | Stack Overflow
Minimal test program
We can then test it up with a minimal NASM 64-bit program:
global _start
_start:
sub rsp, 0x7FF000
mov [rsp], rax
mov rax, 60
mov rdi, 0
syscall
Make sure that you turn off ASLR and remove environment variables as those will go on the stack and take up space:
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
env -i ./main.out
The limit is somewhere slightly below my ulimit -s
(8MiB for me). Looks like this is because of extra System V specified data initially put on the stack in addition to the environment: Linux 64 command line parameters in Assembly | Stack Overflow
If you are serious about this, TODO make a minimal initrd image that starts writing from the stack top and goes down, and then run it with QEMU + GDB. Put a dprintf
on the loop printing the stack address, and a breakpoint at acct_stack_growth
. It will be glorious.
Related:
- https://softwareengineering.stackexchange.com/questions/207386/how-are-the-size-of-the-stack-and-heap-limited-by-the-os
- Where is the stack memory allocated from for a Linux process? | Stack Overflow
- What is the Linux Stack? | Stack Overflow
- What is the maximum recursion depth in Python, and how to increase it? on Stack Overflow
Solution 3
By default, the maximal stack size is configured to be 8MB per process,
but it can be changed using ulimit
:
Showing the default in kB:
$ ulimit -s
8192
Set to unlimited:
ulimit -s unlimited
affecting the current shell and subshells and their child processes.
(ulimit
is a shell builtin command)
You can show the actual stack address range in use with:
cat /proc/$PID/maps | grep -F '[stack]'
on Linux.
Amos
Playground driven developer with negative productivity, Snippet builder & Code listener
Updated on September 18, 2022Comments
-
Amos almost 2 years
Does the OS reserve the fixed amount of valid virtual space for stack or something else? Am I able to produce a stack overflow just by using big local variables?
I've wrote a small
C
program to test my assumption. It's running on X86-64 CentOS 6.5.#include <string.h> #include <stdio.h> int main() { int n = 10240 * 1024; char a[n]; memset(a, 'x', n); printf("%x\n%x\n", &a[0], &a[n-1]); getchar(); return 0; }
Running the program gives
&a[0] = f0ceabe0
and&a[n-1] = f16eabdf
The proc maps shows the stack:
7ffff0cea000-7ffff16ec000. (10248 * 1024B)
Then I tried to increase
n = 11240 * 1024
Running the program gives
&a[0] = b6b36690
and&a[n-1] = b763068f
The proc maps shows the stack:
7fffb6b35000-7fffb7633000. (11256 * 1024B)
ulimit -s
prints10240
in my PC.As you can see, in both case the stack size is bigger than which
ulimit -s
gives. And the stack grows with bigger local variable. The top of stack is somehow 3-5kB more off&a[0]
(AFAIK the red zone is 128B).So how does this stack map get allocated?
-
Amos almost 10 yearsSo when a program is loaded by the current shell, OS will make a memory segment of
ulimit -s
KB valid for the program. In my case it's 10240KB. But when I declare a local arraychar a[10240*1024]
and seta[0]=1
, the program exits correctly. Why? -
vinc17 almost 10 yearsTry to set the last element too. And make sure that they are not optimized away.
-
Volker Siegel almost 10 years@amos I think what vinc17 means is that you named a memory region that would not fit on the stack in your program, but as you do not actually access it in the part that does not fit, the machine never notices that - it does not even get that information.
-
goldilocks almost 10 years@amos Try
int n = 10240*1024; char a[n]; memset(a,'x',n);
...seg fault. -
Amos almost 10 years@vinc17 I've set both the first and last element and it still works correctly.
-
Amos almost 10 years@goldilocks I've tested both memset and for loop. No seg fault happened.
-
goldilocks almost 10 yearsYou may have to bump the number up slightly. The stack grows down and the OS doesn't police it for you -- the problem will occur when something that's outside that space is overwritten. Keep adding to
n
and it will happen. -
goldilocks almost 10 yearsI.e. The limit on stack space is not enforced. It's just a caveat -- go beyond this and risk trouble.
-
Volker Siegel almost 10 years@amos You can show the actual stack address range with
cat /proc/$PID/maps
- while running. -
vinc17 almost 10 years@amos By setting
a[0]
anda[10240*1024-1]
with the default 8MB stack size, I get a segfault under GNU/Linux (Debian/unstable). -
Amos almost 10 yearsThat seems the correct answer to my question. But it confuses me more. When will the mremap call get triggered? Will it be a syscall built into the program?
-
vinc17 almost 10 years@amos I assume that the mremap call will be triggered if need be at a function call or when alloca() is called.
-
Amos almost 10 years@VolkerSiegel The stack map shows :7fffae0b3000-7fffaeab5000.
-
Volker Siegel almost 10 years@amos Pasting the range into bash and adding some chars:
echo $(((0x7fffae0b3000-0x7fffaeab5000)/-1024))
says it's 10248 KB -
Amos almost 10 years@VolkerSiegel I set
n = 11248 * 1024
it still works correctly. It's more confusing now. -
Volker Siegel almost 10 years@amos [mild cynism ahead]: that's why we work on virtual machines, not on bare metal, usually this century ;)
-
vinc17 almost 10 years@amos Could you output the address of
a[0]
? Some systems have discontinuous stacks (see the-fsplit-stack
GCC option). -
Amos almost 10 years@vinc17 It's 0x293d2480
-
vinc17 almost 10 years@amos So, as you can see,
a[]
has not been allocated in your 10MB stack. The compiler might have seen that there couldn't be a recursive call and has done special allocation, or something else like a discontinuous stack or some indirection. -
Volker Siegel almost 10 years@amos Try to disable all optimisations when compiling - should give more 'naively expected' behaviour.
-
Amos almost 10 years@vinc17 Sorry I've re-executed the program and mess up the maps. Here is the correct view: /proc/pid/maps shows:
7fff55f30000-7fff56a5c000
and a[0]'s addr is55f30da0
, a[n-1]'s addr is56a5a42f
. The stack just grows. -
terdon almost 10 yearsPlease try to avoid long discussions in the comments, that's what the Unix & Linux Chat is for.
-
Alen Milakovic almost 10 yearsIt would probably be a good idea to mention what mmap is, for people who don't know.
-
vinc17 almost 10 years@FaheemMitha I've added some information. For those who don't know what mmap is, see the memory FAQ mentioned above. Here, for the stack, it would have been "anonymous mapping" so that unused space wouldn't take any physical memory, but as explained by Mel Gorman, the kernel does the mapping (virtual memory) and the physical allocation at the same time.
-
max over 7 yearsIf I understood you correctly, the kernel doesn't allocate physical memory for the maximum stack size, and instead silently remaps virtual stack to a new physical location as the process uses more and more stack space. This explains why the stack can end up in different physical memory locations, but why does the kernel happily allocate more stack space than the stack limit (given by
ulimit -s
)? -
vinc17 over 7 years@max A bug on your system? With the default 8MB maximum stack size,
int main (void) { volatile char buffer[1UL << 23]; buffer[0] = 0; return 0; }
yields a SIGSEGV as expected on my Debian/unstable x86_64 machine. -
max over 7 yearsI meant why it happened in OP's situation? Didn't he end up with a stack larger than the limit reported by ulimit?
-
vinc17 over 7 years@max I've tried the OP's program with
ulimit -s
giving 10240, like under the OP's conditions, and I get a SIGSEGV as expected (this is what is required by POSIX: "If this limit is exceeded, SIGSEGV shall be generated for the thread."). I suspect a bug in the OP's kernel. -
Max Power over 2 years-s only gives the soft stack limit, you need -Hs to get the hard limit. (-s -H is also ok, but not -sH as -s can be used to set a limit and H is not a valid value) on my Debian Bullseye -S -s gives 8192 KiB but -H -s gives "unlimited". From bash manual page, "A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit. ... If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given. "