Quick way to count number of instructions executed in a C program

c linux profile

14,094

Solution 1

Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.

Here's an example adapted from the man perf_event_open page:

perf_event_open.c

#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>
#include <sys/types.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld instructions\n", count);

    close(fd);
}

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c
./perf_event_open.out

Output:

Used 20016 instructions

So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).

If I vary the argument, even to low values such as 100:

./perf_event_open.out 100

it gives:

Used 216 instructions

maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.

Now you might also be interested in:

prevent reordering of the syscalls: Enforcing statement order in C++
prevent the test loop from being optimized out: How to prevent GCC from optimizing out a busy wait loop?

Other events of interest that can be measured by this system call:

cycle counts: How to get the CPU cycle count in x86_64 from C++?

Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.

Solution 2

You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.

Solution 3

Intel Pin's `instcount`

You can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).

To count the number of instructions with Pin:

Download the latest (or 3.10 if this answer gets old) pin kit from here.
Extract everything and go to the directory: cd pin-root/source/tools/ManualExample/
Make all the tools in the directory: make all
Run the tool called inscount0.so using the command: ../../../pin -t obj-intel64/inscount0.so -- your-binary-here
Get the instruction count in the file inscount.out, cat inscount.out.

The output would be something like:

➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls
buffer_linux.cpp       itrace.cpp
buffer_windows.cpp     little_malloc.c
countreps.cpp          makefile
detach.cpp         makefile.rules
divide_by_zero_unix.c  malloc_mt.cpp
isampling.cpp          w_malloctrace.cpp
➜ cat inscount.out
Count 716372

Solution 4

Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb to watch your program counter register changes:

# instructioncount.gdb
set pagination off
set $count=0
while ($pc != 0xyourstoppingaddress)
    stepi
    set $count++
end
print $count
quit

Then, start gdb on your program:

gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments

To get the end address 0xyourstoppingaddress, you can use the following script:

# stopaddress.gdb
break main
run
info frame
quit

which puts a breakpoint on the function main, and gives:

$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments
...
Stack level 0, frame at 0x7fffffffdf70:
 rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20
 source language c.
 Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048
...

Here what is important is the saved rip 0x7ffff7a66d20 part. On my CPU, rip is the instruction pointer, and the saved rip is the "return address", as stated by pepero in this answer.

So in this case, the stopping address is 0x7ffff7a66d20, which is the return address of the main function. That is, the end of the program execution.

Solution 5

Probably a duplicate of this question

I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.

My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.

I might miss something important here, though.

View more solutions

14,094

Author by

Jean

Full time Plumber. I speak Malayalam. ആന പോകുന്ന പൂമരത്തിന്‍റെ ചോടെപോകുന്നതാരെടാ.. ആരാനുമല്ല കൂരാനുമല്ല കുഞ്ഞുണ്ണിമാഷും കുട്ട്യോളും - കുഞ്ഞുണ്ണിമാഷ് 7 out of 10 internet users don't know that Ad free browsing is possible https://adblockplus.org/

Updated on July 25, 2022

Comments

Jean almost 2 years

Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?

I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.
- TJD over 11 years
  
  I agree with Doness' answer that typically people want to profile execution time per function. However, if you really want to get exact counts of each instruction executed, then you need to run your code on an instruction set simulator, such as simplescalar.com
- newpxsn over 11 years
  
  Can you elaborate on what you are trying to accomplish? On x86, instruction execution performance depends far, far more on context than it does on the actual instruction -- virtually all instructions can optionally be loads or stores, for example. And purely register-to-register instructions are going to depend in complex ways on the pipeline state on modern CPUs. This doesn't sound like useful information to me.
- Basile Starynkevitch over 11 years
  
  Why do you ask? Usually profiling means something different... Eg compile with gcc -pg -Wall -O and use gprof or perhaps oprofile !!
- Jean over 11 years
  
  I am implementing a complex mathematical algorithm and I wanted to count the number of multiplications(and divisions) which happens during its execution.I was looking for an easy way other than looking at the high level code and inferring the numbers.Maybe I should use a custom multiply function and insert a counter in it.
- Basile Starynkevitch over 11 years
  
  Memory accesses, notably with cache misses, cost much more than divisions. Arithmetic is essentially free on recent processors, what matters is memory accesses and cache misses.... When the processor gets a cache miss and have to fetch data from your RAM modules, it is losing many hundreds of clock cycles (enough to compute dozens of divisions with register operands).
- Jean over 11 years
  
  I agree,but this application is finally going to be run on a custom hardware with zero wait memory where 32bit/64bit multiplication/division is going to be costly. I wanted to get an estimate of math overhead involved before hand during the prototyping. Math operations are essentially going to remain same during porting to the real platform.
- newpxsn over 11 years
  
  I'm not sure I believe "zero wait memory", even L1 cache on modern CPUs is 4 cycles! But regardless: looks to tricks like building your app in C++ using a custom operator*() implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with the LEA instruction).
- Peter Cordes over 5 years
  
  Related How do I determine the number of x86 instructions executed in a C program?
mpen over 8 years

Number of CPU instructions executed would be an easy way to compare algorithms without worrying about hiccups or competing for resources with other programs, independently of processing power although still dependent on instruction set.
Paul R over 7 years

@mpen: not necessarily, e.g. if you have one algorithm which use large lookup tables, and another which does the same thing using a more computational approach, then the first may have a lot more load instructions, each of which could potentially stall for > 100 cycles due cache misses. Similarly you might have one algorithm which uses a lot of expensive instructions, e.g. FSQRT, and another algorithm which avoids such expensive instructions and maybe uses a few more adds/multiplies - the second may well be faster even though it executes more instructions.
user2316602 over 5 years

Could you expand the answer? While a good pointer, for someone who does not know these technologies, it is difficult to know what exactly it is.
husin alhaj ahmade over 5 years

@user2316602, today processors are equipped with special registers called hardware performance counters, or hardware performance monitoring unit. These registers can be configured to count micro-architecture events like cache miss, number of store , load instruction and the number of executed instructions, also called retired instructions. some operating system provide an interface to access these counters directly. I have been performed many experiments and processes to access and use these counters. The best way is to use the PAPI infrastructure. PAPI
Alex Spurling about 3 years

When I run this I get: "Error opening leader 1". Does this require root privilege? I checked the documentation for perf_event_open and this doesn't seem to be the case but I might be missing something.
Ciro Santilli OurBigBook.com about 3 years

@AlexSpurling I have just re-run on Ubuntu 20.10 + same hardware as mentioned in the answer now and it worked without sudo. Therefore, either you're missing some kernel config, or there's some hardware support issue. What's your distro + exact CPU model? Dedicated discussion at: stackoverflow.com/questions/38442839/…
Adham Zahran about 2 years

you did not answer the question