CPU and memory usage of jemalloc as compared to glibc malloc

17,612

Solution 1

One wise guy said on CppCon that you never have to guess about performance. You have to measure it instead.

I tried to use jemalloc with multithreaded Linux application. It was custom application level protocol server (over TCP/IP). This C++ application used some Java code via JNI (near 5% of time it used Java, and 95% of time it used C++ code) I run 2 application instances in production mode. Each one had 150 threads.

After 72 hours of running glibc one used 900 M of memory, and jemalloc one used 2.2 G of memory. I didn't see significant CPU usage difference. Actual performance (average client request serving time) was near the same for both instances.

So, in my test glibc was much better than jemalloc. Of course, it is my application specific.

Conclusion: If you have reasons to think that your application memory management is not effective because of fragmentation, you have to make test similar to one I described. It is the only reliable information source for your specific needs. If jemalloc is always better that glibc, glibc will make jemalloc its official allocator. If glibc is always better, jemalloc will stop to exist. When competitors exist long time in parallel, it means that each one has its own usage niche.

Solution 2

Aerospike implemented jemalloc on our NoSQL database, and publicly released the implementation about a year ago with v3.3.x. Just today Psi Mankoski published an article on High Scalability about why and how we did it, and the performance improvement it gave compared to GlibC malloc.

We actually saw a decrease in RAM utilization because of the way we were able to use jemalloc's debugging capability to minimize RAM fragmentation. In the production environment, server % Free Memory was often a "spiky graph," and had often spiked as high as 54% prior to the implementation of JEMalloc. After implementation, you can see the decrease in RAM utilization over the 4-month analysis period. RAM % free memory began to "flatline" and be far more predictable, hovering between ~22-40% depending on the server node.

As Preet says, there was a lot less fragmentation over time, which means less RAM utilization. Psi's article gives "proof in the pudding" behind such a statement.

Solution 3

This question might not belong here since for real-world solutions, it should be irrelevant what other people found on their different hardware/environments/usage scenarios. You should test on the target system and see what suits you.

As for the higher memory footprint, one of the most classical performance optimizations in computer science is the time-memory tradeoff. That is, caching certain results for instant lookup later on and preventing frequent recalculation. Also, since it is presumably a lot more complex, there would probably be a lot more internal bookkeeping. This kind of tradeoff should be more or less expected, especially when picking between variants of such low level and widely used core modules. You have to cater the peformance characteristics to your usage characteristics, since usually, there is no silver bullet.

You might also want to look at google's TCMalloc which is quite close although I believe Jemalloc is slightly more performant in general, as well as creating less heap fragmentation over time.

Solution 4

I am developing simple NoSQL database.
(https://github.com/nmmmnu/HM4)

jemalloc vs standard malloc

When I use jemalloc, performance decrease, but memory "fragmentation" decreases as well. Jemalloc also seems to use less memory on the peak, but difference is 5-6%.

What I mean with memory fragmentation is as follows.

  • First I allocate lots of key value pairs (5-7 GB of memory)
  • Then I look at the memory usage.
  • Then I deallocate all pairs and any other memory my executable uses. Order of allocation is same as order of deallocation.
  • Finally I check memory usage again.

In standard malloc, usage is almost like on the peak. (I especially checked for mmap memory and there is none).

With jemalloc usage is minimal.


bonus information - tcmalloc

Last time I checked with tcmalloc, it was really very fast - probably 10% improvements over standard malloc.

On the peak, it consumes less memory than standard malloc, but more than jemalloc.

I do not remember about the memory fragmentation, but it was far from jemalloc result.

Solution 5

This paper investigates the performance of different memory allocators.

Share some conclusions here:

enter image description here enter image description here

Figure 1 shows the effects of different allocation strategies on TPC-DS with scale factor 100. We measure memory consumption and execution time with our multi-threaded database system on a 4-socket Intel Xeon server. In this experiment, our DBMS executes the query set sequentially using all available cores. Even this relatively simple workload already results in significant performance and memory usage differences. Our database linked with jemalloc can reduce the execution time to 1/2 in comparison to linking it with the standard malloc of glibc 2.23.

Share:
17,612

Related videos on Youtube

deb
Author by

deb

Updated on September 15, 2022

Comments

  • deb
    deb almost 2 years

    I had recently learnt about jemalloc, it is the memory allocator used by firefox. I have tried integrating jemalloc into my system by overriding new and delete operator and calling the jemalloc equivalents of malloc and free i.e je_malloc and je_free.I have written a test application that does 100 million allocations.I have run the application both with glibc malloc and jemalloc, while running with jemalloc takes lesser time for such allocations the CPU utilization is pretty high, moreover the the memory foot print is also larger as compared to malloc. After reading this document on jemalloc analysis it seemed that jemalloc might have footprints greater than malloc as it employs techniques to optimize speed than memory. However, I haven't got any pointers to the CPU usage with Jemalloc. I would like to state that I working on a multiprocessor machine the details of which are given below.

    processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5680 @ 3.33GHz stepping : 2 cpu MHz : 3325.117 cache size : 12288 KB physical id : 1 siblings : 12 core id : 10 cpu cores : 6 apicid : 53 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm bogomips : 6649.91 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8]

    I am using top -c -b -d 1.10 -p 24670 | awk -v time=$TIME '{print time,",",$9}' to keep track of the CPU usage.

    Did someone have similar experiences while integrating Jemlloc?

    Thanks!

  • deb
    deb over 11 years
    thanks for your comments Preet. I was trying to figure out if someone had similar observations on any other multiprocessor machine.I completely agree with you that the exact performance would be completely hardware dependent, however I was wondering if the pattern of CPU utilization should be the same i.e higher for jemalloc in multiprocessor env.
  • Myst
    Myst almost 6 years
    I'm curious about comparing your results with a (semi)lockless allocator I wrote (it's only 3-files, a short header, source file and spin lock). It's less generic, but it could be easy used as a drop in malloc replacement.