Linux C++: how to profile time wasted due to cache misses?

33,431

Solution 1

You could try cachegrind and it's front-end kcachegrind.

Solution 2

Linux supports with perf from 2.6.31 on. This allows you to do the following:

  • compile your code with -g to have debug information included
  • run your code e.g. using the last level cache misses counters: perf record -e LLC-loads,LLC-load-misses yourExecutable
  • run perf report
    • after acknowledging the initial message, select the LLC-load-misses line,
    • then e.g. the first function and
    • then annotate. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.

Solution 3

You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.

However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking. No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.

A quick Google search turned up boost::intrusive_ptr which might interest you. It doesn't appear to support something like weak_ptr, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.

Solution 4

Continuing along the lines of @Mike_Dunlavey's answer:

First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.

Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...

I.e. the first profile associates a "time spent" with each program counter. The second associates a "number of cache misses" value with each program counter.

Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.

OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:

"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.

("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)

XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...

Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)


More advice:

If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.

By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.

More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.

Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like

template<typename T> class refcntptr {
    refcnt_handle<T> handle;
public:
    refcntptr(T*obj) {
        this->handle = new refcnt_handle<T>();
        this->handle->ptr = obj;
        this->handle->count = 1;
    }
};
template<typename T> class refcnt_handle {
    T* ptr;
    int count;
    friend refcnt_ptr<T>;
};

(I wouldn't code it this way, but it serves for exposition.)

The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.

In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)

E.g.

template<typename T> class refcntptr {
    refcnt_info<T> info;
    T* ptr;
public:
    refcntptr(T*obj) {
        this->ptr = obj;
        this->info = new refcnt_handle<T>();
        this->info->count = 1;
    }
};
template<typename T> class refcnt_info {
    T* ptr; // perhaps not necessary, but useful.
    int count;
    friend refcnt_ptr<T>;
};

Anyway - a time profile is your best friend.


Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.

Solution 5

It depends on what OS and CPU you are using. E.g. for Mac OS X and x86 or ppc, Shark will do cache miss profiling. Ditto for Zoom on Linux.

Share:
33,431
anon
Author by

anon

Updated on September 18, 2020

Comments

  • anon
    anon over 3 years

    I know that I can use gprof to benchmark my code.

    However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).

    As a result, I have this extra layer that effects pretty much all functions, and screws with caching.

    Is there a way to measure the time my CPU wastes due to cache misses?

  • Potatoswatter
    Potatoswatter about 14 years
    Problem is, cache pollution will cause misses all over the place. What pattern is there to look for?
  • Matthieu M.
    Matthieu M. about 14 years
    Indeed it's not possible to use a weak_ptr with an Intrusive Counter as an intrusive counter is destroyed with the object... and so the weak_ptr has no way to check whether or not the object is valid without actually accessing it.
  • Potatoswatter
    Potatoswatter about 14 years
    @Matthieu: If the dependency graph is known to be a single cycle, I think you can use each object's link (there must be only one) as a validity flag. For the purpose of destruction anyway. Traversing a random graph would require thread-local storage, but that's not impossible.
  • Fabien Hure
    Fabien Hure about 14 years
    First thing you need to find out is: Is there really a problem in your particular application. Profile your application as a user would use it, then check in the report where your bottlenecks are located. You might find a high number of L2 Cache Line Miss, but it might be caused by other parts of your application and discover other issues that you were not worried about. It does not mean that you do not have a problem with your smart pointers, but it is hidden behind more pressing bottlenecks. Tell me how it goes I'd be happy to help on any performance issue.
  • federal
    federal over 12 years
    Unfortunately, Shark isn't supported by Apple anymore, so it won't work for much longer as new OSX/iOS versions come out. Zoom is still actively developed/updated and got a major release just a few months ago.
  • Paul R
    Paul R over 12 years
    @federal: indeed - it's sad that Shark has now lapsed into a state of neglect - Instruments just doesn't cut it for the kind of tasks that Shark excelled at. I've decided to stick with Snow Leopard and Xcode 3.2.6 for as long as I realistically can - maybe in a year or two Xcode 4.x and Instruments will be more useful.
  • federal
    federal over 11 years
    UPDATE: Zoom (v3.0) now profiles on Mac OS X and Linux. Shark is officially gone.
  • Paul R
    Paul R over 11 years
    Also Instruments has improved somewhat, but I think it still doesn't do cache miss profiling.
  • Krazy Glew
    Krazy Glew almost 11 years
    However, spending 50% of the time with the program counter at retirement stalled pointing to a cache miss is NOT the same as saying that the program is spending 50% of its time in cache misses, or that the program would double in speed if all cache misses were removed. In a speculative machine a cache miss may not stall retirement, but may stall another instruction that itself stalls retirement.
  • Mike Dunlavey
    Mike Dunlavey almost 11 years
    @KrazyGlew: I'm sure you're right that the chip architects, in their laudable quest for speed, have made it less than easy to track down just which code is causing them headaches :)
  • Krazy Glew
    Krazy Glew almost 11 years
    @Mike_Dunlavey: mea culpa. Both as a CPU architect, and as the CPU architect responsible for making it possible to profile on cache misses (EMON profiling). But, I'm also a performance analyst, was a perf analyst before I became a CPU architect. If I knew of a cheap way to measure exactly how much time is being lost to cache misses, I would have built it. In the mean time, the best I can think of doing is to describe what is being measured.
  • Mike Dunlavey
    Mike Dunlavey almost 11 years
    @KrazyGlew: The last time I worked directly with chip-level guys (BCE) they wanted to help me out with measurement timers of one sort or another. I said thanks but no thanks. Measurement doesn't tell me what I need to know. What I need to know is, at a random time of my choosing, What Is It Waiting For Right Now (WIIWFRN :) See the difference? If I ask it that question, at 10 random times, and it tells me on 3 of those times it's in cache wait due to IP=foo, I know exactly what code needs to be fixed to pick up 30% (roughly). An ICE sufficed, back then.
  • Mike Dunlavey
    Mike Dunlavey almost 11 years
    + As long as we're on the subject, here's what I wish the CPU would do: When it enters a cache-wait (L1 or L2) it records the PC that triggered it. Then if an interrupt has been requested, it should allow it to be answered, either during or just after the cache-wait, and let that information be queried. (It's OK if it causes the wait to restart.) Then if software wants to get lots of those and sum them up, it can. Personally I recommend against the summing-by-region part, because all that does is smear out the precise locations that nail the problem.
  • Krazy Glew
    Krazy Glew almost 11 years
    Unfortunately, on many machines the PC (Ip in x86 land) of the instruction that caused the cache miss is not available. The PC (IP) of the instruction that is stalling retirement waiting for the cache miss to come back is available, but not the PC that it is waiting on. We don't even have, easy to hand, what the stalling instruction is waiting on - all it knows is that it is waiting on an input register. // Now, some more recent processors route the PC all the way to the memory units, to do stuff like STLF prediction. If I were still at Intel I'd have done this already.
  • Krazy Glew
    Krazy Glew almost 11 years
    AMD IBS (Instruction ased Sampling) is relevant here.
  • Mike Dunlavey
    Mike Dunlavey almost 11 years
    I looked that up - impressive. It amazes me how chip jockeys tune their stuff to ever higher performance (though they may be hitting a limit), while software developers go the other way. (Of course it's amusing how they always end up using matrix multiplication as the canonical speed test, when real mega-liners waste tons of time simply with not-really-necessary function calls.)
  • Richard
    Richard almost 7 years
    Where can one find a list of events such as this?
  • Andre Holzner
    Andre Holzner almost 7 years
    perf list prints the list of available events. The list of events depends on the machine you are running on. On a virtual machine I logged on I get 10 software events and two 'Raw hardware event descriptor' events. On a physical machine I just logged on (Xeon E5), I get 26 'Hardware cache event' types, 10 'Hardware event' types, 28 'Kernel PMU event' types, 10 'Software event' types, two 'Raw hardware event descriptor' events and one 'Hardware breakpoint' type.
  • Krazy Glew
    Krazy Glew over 5 years
    @Potatoswatter - noticed comment years later. WRT patterns, I look for frequency patterns. Eg. If you have data addresses, histogram by associative sets in L1/L2/TLB. Try a Fourier Transform to look for frequencies. Those are spatial - if you have time of miss, you can look for time and spatial frequencies.