Increase of virtual memory without increse of VmSize

linux fortran mpi

14,218

(The possible solution to your problem is the last paragraph)

Memory allocation on most modern operating systems with virtual memory is a two-phase process. First, a portion of the virtual address space of the process is reserved and the virtual memory size of the process (VmSize) increases accordingly. This creates entries in the so-called process page table. Pages are initially not associated with phyiscal memory frames, i.e. no physical memory is actually used. Whenever some part of this allocated portion is actually read from or written to, a page fault occurs and the operating system installs (maps) a free page from the physical memory. This increases the resident set size of the process (VmRSS). When some other process needs memory, the OS might store the content of some infrequently used page (the definition of "infrequently used page" is highly implementation-dependent) to some persistent storage (hard drive in most cases, or generally to the swap device) and then unmap up. This process decreases the RSS but leaves VmSize intact. If this page is later accessed, a page fault would again occur and it will be brought back. The virutal memory size only decreases when virtual memory allocations are freed. Note that VmSize also counts for memory mapped files (i.e. the executable file and all shared libraries it links to or other explicitly mapped files) and shared memory blocks.

There are two generic types of memory in a process - statically allocated memory and heap memory. The statically allocated memory keeps all constants and global/static variables. It is part of the data segment, whose size is shown by the VmData metric. The data segment also hosts part of the program heap, where dynamic memory is being allocated. The data segment is continuous, i.e. it starts at a certain location and grows upwards towards the stack (which starts at a very high address and then grows downwards). The problem with the heap in the data segment is that it is managed by a special heap allocator that takes care of subdividing the contiguous data segment into smaller memory chunks. On the other side, in Linux dynamic memory can also be allocated by directly mapping virtual memory. This is usually done only for large allocations in order to conserve memory, since it only allows memory in multiples of the page size (usually 4 KiB) to be allocated.

The stack is also an important source of heavy memory usage, especially if big arrays are allocated in the automatic (stack) storage. The stack starts near the very top of the usable virtual address space and grows downwards. In some cases it could reach the top of the data segment or it could reach the end of some other virtual allocation. Bad things happen then. The stack size is accounted in the VmStack metric and also in the VmSize. One can summarise it as so:

VmSize accounts for all virtual memory allocations (file mappings, shared memory, heap memory, whatever memory) and grows almost every time new memory is being allocated. Almost, because if the new heap memory allocation is made in the place of a freed old allocation in the data segment, no new virtual memory would be allocated. It decreses whenever virtual allocations are being freed. VmPeak tracks the max value of VmSize - it could only increase in time.
VmRSS grows as memory is being accessed and decreases as memory is paged out to the swap device.
VmData grows as the data segment part of the heap is being utilised. It almost never shrinks as current heap allocators keep the freed memory in case future allocations need it.

If you are running on a cluster with InfiniBand or other RDMA-based fabrics, another kind of memory comes into play - the locked (registered) memory (VmLck). This is memory which is not allowed to be paged out. How it grows and shrinks depends on the MPI implementation. Some never unregister an already registered block (the technical details about why are too complex to be described here), others do so in order to play better with the virtual memory manager.

In your case you say that you are running into a virtual memory size limit. This could mean that this limit is set too low or that you are running into an OS-imposed limits. First, Linux (and most Unixes) have means to impose artificial restrictions through the ulimit mechanism. Running ulimit -v in the shell would tell you what the limit on the virtual memory size is in KiB. You can set the limit using ulimit -v <value in KiB>. This only applies to processes spawned by the current shell and to their children, grandchilren and so on. You need to instruct mpiexec (or mpirun) to propagate this value to all other processes, if they are to be launched on remote nodes. if you are running your program under the control of some workload manager like LSF, Sun/Oracle Grid Engine, Torque/PBS, etc., there are job parameters which control the virtual memory size limit. And last but not least, 32-bit processes are usually restricted to 2 GiB of usable virtual memory.

14,218

Author by

vovo

Updated on June 16, 2022

Comments

vovo 12 months
I searched for my problem in Google and at this site but i still don't understand the solution.

I have piece of MPI program which RECV some data. Program crashes on big arrays with error of insufficient virtual memory, and so i started to consider /proc/self/status file.

Before MPI_RECV it was:
```
Name:   model.exe                                                               
VmPeak:   841640 kB
VmSize:   841640 kB
VmHWM:     15100 kB
VmRSS:     15100 kB
VmData:   760692 kB
```
And after:
```
Name:   model.exe                                                            
VmPeak:   841640 kB
VmSize:   841640 kB
VmHWM:    719980 kB
VmRSS:    719980 kB
VmData:   760692 kB
```
I test it on Ubuntu and through System Monitor i saw this memory increasing. But i was confused that there are no changes in VmSize(and VmPeak) parameters.

And the question is - what is the indicator of real memory usage?

Does it mean, that true indicator is VmRSS? (and VmSize is only allocated but still not used memory)
vovo over 10 years

So,it means, that VmSize is maximum of all allocated data(used or unused). I am confused, that admin of our supercomputer wrote me, that they have 12 Gb per 8 kernels. I look at VmSize parameter(it is maximum of all memory types) and all my 8 processors use about 800 mb-1000 mb. Does it mean that there are some other memory packs somewhere and they are not showed by /proc/self/status?
Hristo Iliev over 10 years

@vovo, what MPI library do you use? Typically for intranode communication (i.e. between processes that reside on the same physical node) shared memory is used. And for internode communication over fast networks (e.g. InfiniBand) memory is locked. Besides not all 12 GiB are available to user processes - the OS kernel and the support userland tools need memory too.
vovo over 10 years

I use intel compiler and impi on Infiniband. Of course, i run on thousands kenels - 8 kernels per node(12 Gb per node). Is any possible way to know real(without OS, tools, etc.) limit of my task?
Hristo Iliev over 10 years

@vovo, put ulimit -a in the job script that you submit to the batch system on the cluster and examine the output. If using Sun/Oracle Grid Engine, note that it sets bigger limits on the initial task, i.e. the output from ulimit -a might show way bigger limits than the real MPI processes would get (never understood if this is a bug or a feature of SGE).