How does stack allocation work in Linux?

21,216

Solution 1

It appears that the stack memory limit is not allocated (anyway, it couldn't with unlimited stack). https://www.kernel.org/doc/Documentation/vm/overcommit-accounting says:

The C language stack growth does an implicit mremap. If you want absolute guarantees and run close to the edge you MUST mmap your stack for the largest size you think you will need. For typical stack usage this does not matter much but it's a corner case if you really really care

However mmapping the stack would be the goal of a compiler (if it has an option for that).

EDIT: After some tests on an x84_64 Debian machine, I've found that the stack grows without any system call (according to strace). So, this means that the kernel grows it automatically (this is what the "implicit" means above), i.e. without explicit mmap/mremap from the process.

It was quite hard to find detailed information confirming this. I recommend Understanding The Linux Virtual Memory Manager by Mel Gorman. I suppose that the answer is in Section 4.6.1 Handling a Page Fault, with the exception "Region not valid but is beside an expandable region like the stack" and the corresponding action "Expand the region and allocate a page". See also D.5.2 Expanding the Stack.

Other references about Linux memory management (but with almost nothing about the stack):

EDIT 2: This implementation has a drawback: in corner cases, a stack-heap collision may not be detected, even in the case where the stack would be larger than the limit! The reason is that a write in a variable in the stack may end up in allocated heap memory, in which case there is no page fault and the kernel cannot know that the stack needed to be extended. See my example in the discussion Silent stack-heap collision under GNU/Linux I started in the gcc-help list. To avoid that, the compiler needs to add some code at function call; this can be done with -fstack-check for GCC (see Ian Lance Taylor's reply and the GCC man page for details).

Solution 2

Linux kernel 4.2

Minimal test program

We can then test it up with a minimal NASM 64-bit program:

global _start
_start:
    sub rsp, 0x7FF000
    mov [rsp], rax
    mov rax, 60
    mov rdi, 0
    syscall

Make sure that you turn off ASLR and remove environment variables as those will go on the stack and take up space:

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
env -i ./main.out

The limit is somewhere slightly below my ulimit -s (8MiB for me). Looks like this is because of extra System V specified data initially put on the stack in addition to the environment: Linux 64 command line parameters in Assembly | Stack Overflow

If you are serious about this, TODO make a minimal initrd image that starts writing from the stack top and goes down, and then run it with QEMU + GDB. Put a dprintf on the loop printing the stack address, and a breakpoint at acct_stack_growth. It will be glorious.

Related:

Solution 3

By default, the maximal stack size is configured to be 8MB per process,
but it can be changed using ulimit:

Showing the default in kB:

$ ulimit -s
8192

Set to unlimited:

ulimit -s unlimited

affecting the current shell and subshells and their child processes.
(ulimit is a shell builtin command)

You can show the actual stack address range in use with:
cat /proc/$PID/maps | grep -F '[stack]'
on Linux.

Share:
21,216
Amos
Author by

Amos

Playground driven developer with negative productivity, Snippet builder & Code listener

Updated on September 18, 2022

Comments

  • Amos
    Amos almost 2 years

    Does the OS reserve the fixed amount of valid virtual space for stack or something else? Am I able to produce a stack overflow just by using big local variables?

    I've wrote a small C program to test my assumption. It's running on X86-64 CentOS 6.5.

    #include <string.h>
    #include <stdio.h>
    int main()
    {
        int n = 10240 * 1024;
        char a[n];
        memset(a, 'x', n);
        printf("%x\n%x\n", &a[0], &a[n-1]);
        getchar();
        return 0;
    }
    

    Running the program gives &a[0] = f0ceabe0 and &a[n-1] = f16eabdf

    The proc maps shows the stack: 7ffff0cea000-7ffff16ec000. (10248 * 1024B)

    Then I tried to increase n = 11240 * 1024

    Running the program gives &a[0] = b6b36690 and &a[n-1] = b763068f

    The proc maps shows the stack: 7fffb6b35000-7fffb7633000. (11256 * 1024B)

    ulimit -s prints 10240 in my PC.

    As you can see, in both case the stack size is bigger than which ulimit -s gives. And the stack grows with bigger local variable. The top of stack is somehow 3-5kB more off &a[0] (AFAIK the red zone is 128B).

    So how does this stack map get allocated?

  • Amos
    Amos almost 10 years
    So when a program is loaded by the current shell, OS will make a memory segment of ulimit -sKB valid for the program. In my case it's 10240KB. But when I declare a local array char a[10240*1024] and set a[0]=1, the program exits correctly. Why?
  • vinc17
    vinc17 almost 10 years
    Try to set the last element too. And make sure that they are not optimized away.
  • Volker Siegel
    Volker Siegel almost 10 years
    @amos I think what vinc17 means is that you named a memory region that would not fit on the stack in your program, but as you do not actually access it in the part that does not fit, the machine never notices that - it does not even get that information.
  • goldilocks
    goldilocks almost 10 years
    @amos Try int n = 10240*1024; char a[n]; memset(a,'x',n); ...seg fault.
  • Amos
    Amos almost 10 years
    @vinc17 I've set both the first and last element and it still works correctly.
  • Amos
    Amos almost 10 years
    @goldilocks I've tested both memset and for loop. No seg fault happened.
  • goldilocks
    goldilocks almost 10 years
    You may have to bump the number up slightly. The stack grows down and the OS doesn't police it for you -- the problem will occur when something that's outside that space is overwritten. Keep adding to n and it will happen.
  • goldilocks
    goldilocks almost 10 years
    I.e. The limit on stack space is not enforced. It's just a caveat -- go beyond this and risk trouble.
  • Volker Siegel
    Volker Siegel almost 10 years
    @amos You can show the actual stack address range with cat /proc/$PID/maps - while running.
  • vinc17
    vinc17 almost 10 years
    @amos By setting a[0] and a[10240*1024-1] with the default 8MB stack size, I get a segfault under GNU/Linux (Debian/unstable).
  • Amos
    Amos almost 10 years
    That seems the correct answer to my question. But it confuses me more. When will the mremap call get triggered? Will it be a syscall built into the program?
  • vinc17
    vinc17 almost 10 years
    @amos I assume that the mremap call will be triggered if need be at a function call or when alloca() is called.
  • Amos
    Amos almost 10 years
    @VolkerSiegel The stack map shows :7fffae0b3000-7fffaeab5000.
  • Volker Siegel
    Volker Siegel almost 10 years
    @amos Pasting the range into bash and adding some chars: echo $(((0x7fffae0b3000-0x7fffaeab5000)/-1024)) says it's 10248 KB
  • Amos
    Amos almost 10 years
    @VolkerSiegel I set n = 11248 * 1024 it still works correctly. It's more confusing now.
  • Volker Siegel
    Volker Siegel almost 10 years
    @amos [mild cynism ahead]: that's why we work on virtual machines, not on bare metal, usually this century ;)
  • vinc17
    vinc17 almost 10 years
    @amos Could you output the address of a[0]? Some systems have discontinuous stacks (see the -fsplit-stack GCC option).
  • Amos
    Amos almost 10 years
    @vinc17 It's 0x293d2480
  • vinc17
    vinc17 almost 10 years
    @amos So, as you can see, a[] has not been allocated in your 10MB stack. The compiler might have seen that there couldn't be a recursive call and has done special allocation, or something else like a discontinuous stack or some indirection.
  • Volker Siegel
    Volker Siegel almost 10 years
    @amos Try to disable all optimisations when compiling - should give more 'naively expected' behaviour.
  • Amos
    Amos almost 10 years
    @vinc17 Sorry I've re-executed the program and mess up the maps. Here is the correct view: /proc/pid/maps shows: 7fff55f30000-7fff56a5c000 and a[0]'s addr is 55f30da0, a[n-1]'s addr is 56a5a42f. The stack just grows.
  • terdon
    terdon almost 10 years
    Please try to avoid long discussions in the comments, that's what the Unix & Linux Chat is for.
  • Alen Milakovic
    Alen Milakovic almost 10 years
    It would probably be a good idea to mention what mmap is, for people who don't know.
  • vinc17
    vinc17 almost 10 years
    @FaheemMitha I've added some information. For those who don't know what mmap is, see the memory FAQ mentioned above. Here, for the stack, it would have been "anonymous mapping" so that unused space wouldn't take any physical memory, but as explained by Mel Gorman, the kernel does the mapping (virtual memory) and the physical allocation at the same time.
  • max
    max over 7 years
    If I understood you correctly, the kernel doesn't allocate physical memory for the maximum stack size, and instead silently remaps virtual stack to a new physical location as the process uses more and more stack space. This explains why the stack can end up in different physical memory locations, but why does the kernel happily allocate more stack space than the stack limit (given by ulimit -s)?
  • vinc17
    vinc17 over 7 years
    @max A bug on your system? With the default 8MB maximum stack size, int main (void) { volatile char buffer[1UL << 23]; buffer[0] = 0; return 0; } yields a SIGSEGV as expected on my Debian/unstable x86_64 machine.
  • max
    max over 7 years
    I meant why it happened in OP's situation? Didn't he end up with a stack larger than the limit reported by ulimit?
  • vinc17
    vinc17 over 7 years
    @max I've tried the OP's program with ulimit -s giving 10240, like under the OP's conditions, and I get a SIGSEGV as expected (this is what is required by POSIX: "If this limit is exceeded, SIGSEGV shall be generated for the thread."). I suspect a bug in the OP's kernel.
  • Max Power
    Max Power over 2 years
    -s only gives the soft stack limit, you need -Hs to get the hard limit. (-s -H is also ok, but not -sH as -s can be used to set a limit and H is not a valid value) on my Debian Bullseye -S -s gives 8192 KiB but -H -s gives "unlimited". From bash manual page, "A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit. ... If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given. "