How does Intel TBB's scalable_allocator work?

c++ multithreading stl malloc tbb

20,972

Solution 1

There is a good paper on the allocator: The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks

My limited experience: I overloaded the global new/delete with the tbb::scalable_allocator for my AI application. But there was little change in the time profile. I didn't compare the memory usage though.

Solution 2

The solution you mentioned is optimized for Intel CPUs. It incorporates specific CPU mechanisms to improve performance.

Sometime ago I found another very useful solution: Fast C++11 allocator for STL containers. It slightly speeds up STL containers on VS2017 (~5x) as well as on GCC (~7x). It uses memory pool for elements allocation which makes it extremely effective for all platofrms.

20,972

Author by

timday

"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry." - Henry Petroski "What if we didn't take it to our limit...wouldn't we be forever dissatisfied ?" - Doug Scott "Problems are inevitable. Problems are soluble." - David Deutsch "The incremental increase in systemic complexity is rarely if ever recognized as a problem that additional complexity can't solve." - Charles Hugh Smith (OfTwoMinds blog) "If you don’t make mistakes, you’re not working on hard enough problems. And that’s a big mistake." - Frank Wilczek "Only those that risk going too far can possibly find out how far one can go." – T.S. Eliot "Engineers turn dreams into reality" - Giovanni Caproni (in Hayao Miyazaki's The Wind Rises) "When an Oxford man walks into the room, he walks in like he owns it. When a Cambridge man walks into the room, he walks in like he doesn't care who owns it." - my grandmother "The greatest scientific discovery was the discovery of ignorance" - Yuval Noah Harari. "Always train your doubt most strongly on those ideas that you really want to be true." - Sean Carroll "The first principle is that you must not fool yourself — and you are the easiest person to fool" - Richard Feynman "On the plains of hesitation lie the blackened bones of countless millions who at the dawn of victory lay down to rest, and in resting died." - Adlai E. Stevenson "Therefore Simplicio, come either with arguments and demonstrations and bring us no more Texts and authorities, for our disputes are about the Sensible World, and not one of Paper." - Salviati to Simplicio in Galileo's Dialogue On Two World Systems (1632) "The larger the island of knowledge, the longer the shoreline of wonder." - Ralph W. Sockman "I never enlighten anyone who has not been driven to distraction by trying to understand a difficulty or who has not got into a frenzy trying to put his ideas into words. When I have pointed out one corner of a square to anyone and he does not come back with the other three, I will not point it out to him a second time." - Confucius "The way to bring about the new age of peace and enlightenment is to assume it has already started" - ?

Updated on November 06, 2020

Comments

timday over 3 years

What does the tbb::scalable_allocator in Intel Threading Building Blocks actually do under the hood ?

It can certainly be effective. I've just used it to take 25% off an apps' execution time (and see an increase in CPU utilization from ~200% to 350% on a 4-core system) by changing a single std::vector<T> to std::vector<T,tbb::scalable_allocator<T> >. On the other hand in another app I've seen it double an already large memory consumption and send things to swap city.

Intel's own documentation doesn't give a lot away (e.g a short section at the end of this FAQ). Can anyone tell me what tricks it uses before I go and dig into its code myself ?

UPDATE: Just using TBB 3.0 for the first time, and seen my best speedup from scalable_allocator yet. Changing a single vector<int> to a vector<int,scalable_allocator<int> > reduced the runtime of something from 85s to 35s (Debian Lenny, Core2, with TBB 3.0 from testing).
timday about 15 years

Thanks! Article contains exactly the sort of information I was looking for.
Arto Bendiken about 11 years

The original link is now defunct, but CiteSeer has the PDF: citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.8289
Adam almost 10 years

To add a datapoint: in my particular app, allocator contention halted speedup at around 15 threads, past that it would kill all speedup and by 40 it would be much slower than single-thread. With scalable_allocator used in the inner per-thread kernels the bottleneck disappeared and expected scaling came back. (machine has 40 physical cores).