Global Variables performance effect (c, c++)

c performance static benchmarking global

10,888

Solution 1

It really depends on your compiler, platform, and other details. However, I can describe one scenario where global variables are faster.

In many cases, a global variable is at a fixed offset. This allows the generated instructions to simply use that address directly. (Something along the lines of MOV AX,[MyVar].)

However, if you have a variable that's relative to the current stack pointer or a member of a class or array, some math is required to take the address of the array and determine the address of the actual variable.

Obviously, if you need to place some sort of mutex on your global variable in order to keep it thread-safe, then you'll almost certainly more than lose any performance gain.

Solution 2

Creating local variables can be literally free if they are POD types. You likely are overflowing a cache line with too many stack variables or other similar alignment-based causes which are very specific to your piece of code. I usually find that non-local variables significantly decrease performance.

Solution 3

It's hard to beat static allocation for speed, and while the 10% is a pretty small difference, it could be due to address calculation.

But if you're looking for speed, your example in a comment while(p<end)stats[*p++]++; is an obvious candidate for unrolling, such as:

static int stats[M];
static int index_array[N];
int *p = index_array, *pend = p+N;
// ... initialize the arrays ...
while (p < pend-8){
  stats[p[0]]++;
  stats[p[1]]++;
  stats[p[2]]++;
  stats[p[3]]++;
  stats[p[4]]++;
  stats[p[5]]++;
  stats[p[6]]++;
  stats[p[7]]++;
  p += 8;
}
while(p<pend) stats[*p++]++;

Don't count on the compiler to do it for you. It might or might not be able to figure it out.

Other possible optimizations come to mind, but they depend on what you're actually trying to do.

Solution 4

If you have something like

int stats[256]; while (p<end) stats[*p++]++;

static int stats[256]; while (p<end) stats[*p++]++;

you are not really comparing the same thing because for the first instance you are not doing an initialization of your array. Written explicitly the second line is equivalent to

static int stats[256] = { 0 }; while (p<end) stats[*p++]++;

So to be a fair comparison you should have the first read

 int stats[256] = { 0 }; while (p<end) stats[*p++]++;

Your compiler might deduce much more things if he has the variables in a known state.

Now then, there could be runtime advantage of the static case, since the initialization is done at compile time (or program startup).

To test if this makes up for your difference you should run the same function with the static declaration and the loop several times, to see if the difference vanishes if your number of invocations grows.

But as other said already, best is to inspect the assembler that your compiler produces to see what effective difference there are in the code that is produced.

View more solutions

10,888

Author by

Cyan

Interested in compression algorithms

Updated on June 03, 2022

Comments

Cyan about 2 years

I'm currently developing a very fast algorithm, with one part of it being an extremely fast scanner and statistics function. In this quest, i'm after any performance benefit. Therefore, I'm also interested in keeping the code "multi-thread" friendly.

Now for the question : i've noticed that putting some very frequently accessed variables and arrays into "Global", or "static local" (which does the same), there is a measurable performance benefit (in the range of +10%). I'm trying to understand why, and to find a solution about it, since i would prefer to avoid using these types of allocation. Note that i don't think the difference comes from "allocation", since allocating a few variables and small array on the stack is almost instantaneous. I believe the difference comes from "accessing" and "modifying" data.

In this search, i've found this old post from stackoverflow : C++ performance of global variables

But i'm very disappointed by the answers there. Very little explanation, mostly ranting about "you should not do that" (hey, that's not the question !) and very rough statements like 'it doesn't affect performance', which is obviously incorrect, since i'm measuring it with precise benchmark tools.

As said above, i'm looking for an explanation, and, if it exists, a solution to this issue. So far, i've got the feeling that calculating the memory address of a local (dynamic) variable costs a bit more than a global (or local static). Maybe something like an ADD operation difference. But that doesn't help finding a solution...
Cyan over 13 years

I fully agree with your statement : mutex just slows down everything. I even tried to allocate a few tables and select a "free one" on starting the function; it works, but has the same performance penalty as Local Variables; so this is useless complexity
Cyan over 13 years

Your statement is correct in most "normal" circumstances, but here i'm benchmarking differences in favor of non-local variables. So it is not a matter of "if it happens", because it does. The question is "why".
Steve Jessop over 13 years

"some math is required" - generally the difference would be that for the global or local static, the address could be a constant fixup. For an automatic variable the address would be a constant offset applied to the current stack pointer. Depending on the addressing modes offered by the CPU, this may or may not actually affect performance.
Jonathan Wood over 13 years

@Steve: Yes, that's why I said it depends on the compiler and platform. Note that I was thinking in terms of non-static local variables. I would assume most static local variables would be stored similar to how global variables are.
Steve Jessop over 13 years

@Jonathan: sorry, wasn't trying to disagree, just expand. Also, if the optimizer notices that it's doing sp+187 a lot, and this is likely to cost time, then it can cache the value of sp+187 in a register, or move the sp for a portion of the routine if the OS doesn't mind. These are the sorts of thing the questioner will need to check before concluding that your example applies in this case.
Jonathan Wood over 13 years

@Steve: Yes, for this level of optimization, a dump of the assembly language being produced by the compiler should be in order.
Puppy over 13 years

@Cyan: I believe that I may have suggested an answer, such as cache line overflowing. Please do read the whole answer.
ShinjiOno over 13 years

@Jens Gustedt: Plain Old Data, such as ints, etc.
Nils Pipenbrinck over 13 years

There is no math required. x86 can do mov reg, [esp +/- offset] It even can do indirect addressing via mov reg, [esp + offset + 4*reg] ect.
Jonathan Wood over 13 years

@Nils: esp + offset + 4 * reg looks like math to me. Are you suggesting there is no case where a fixed address would be faster?
Bo Persson over 13 years

It really does depend lot on the hardware. For example on an x86_64 the global address will add 8 bytes to the instruction size. The offset from a register will be smaller. I have seen cases where accessing members of a C++ class is faster than global variables, just because the address is loaded only once.
Cyan over 13 years

OK, both versions are initialized on entering the function. I have not transcribed this part in the example, but it does start with a simple : " for (=0; i<256; i++) stats[i]=0; " This is the same for both, so no "initialization" difference.
Cyan over 13 years

Thanks Mike. Good advise. In fact i'm already doing unrolling. That's why i said this was just an "example", because the full code is a bit more complex, and doesn't bring anything useful to this discussion.
Cyan over 13 years

OK, i'm using x86_32 compilator MS Visual Cpp. I do understand that the effect mentioned may be different in another environment, such as x86_64.
Cyan over 13 years

don't feel offended ! Yes, I've read your answer, and i already agree with the fact that no difference comes from allocation. Now your comment about cache line is not so clear to me. The amount of data created (~3K) in the stack is well below the L1 cache size (~32K), so it should fit entirely in L1 cache. I don't know how i can ensure if it is cache-line aligned (ie an address multiple of 64), nor if it would make any difference. In fact, i've made a few tests on this (by making one of the table a little bit bigger or smaller), and it does not seem to help.
Jonathan Wood over 13 years

@Cyan: Do you not think this effect is the answer to your question?
Cyan over 13 years

@Jonathan : yes i do. Your answer is best candidate so far. Steve comments are also very informative. However, it does only provide some good insight for the "why". But so far, no work-around proposed...
Jonathan Wood over 13 years

@Cyan: Feel free to mark as the answer whatever answer you think is best. However, I'm not sure there is a workaround. Addressing memory in certain ways can be more efficient in other ways. If you really want to narrow this down, export an assembly listing using your compiler and see exactly what is happening. Then see if you can tweak that a bit. But the speed of relative addressing types cannot be worked around.
Mike Dunlavey over 13 years

@Cyan: The kinds of questions that come to mind are - Why the indirection array? How often does does the setup change? Could this be a candidate for precompiling? Stuff like that.
Cyan over 13 years

@Mike: Interesting points. What do you call "indirection array" ? Pre-compiling is a good optimization route. That's why i like the idea of attract about templates, which is one way pre-compile one of the arguments (as long as it has a finite number of values).
Mike Dunlavey over 13 years

@Cyan: Well in this case you're indexing stats by first indexing another array index_array and you're looping through that. That implies you want to only increment a subset of stats (because obviously the order doesn't matter). If setting up index_array is only done at low frequency, you might also do it by simply printing out a function to increment the desired members of stats, compile & link a dll on the fly, dynamically load it, and do your incrementing really fast. That's what I mean about what are you really trying to do?
Cyan over 13 years

ok; index_array is specific to your example, but does not reflect the code i'm working on. In my case, we are starting with a buffer, which is defined as a char* and a length. Then we go through the buffer, counting the char values in it. Therefore, values are between 0 and 255.
Mike Dunlavey over 13 years

@Cyan: Sounds like you've got it about right. Now, I believe in a particularly aggressive form of performance tuning. Here's an example. Basically it involves an iteration of tuning steps of a kind that some people find natural, but most people find strange. Anyway, that's what I would do, in your shoes. You may already be close to optimal, but that would ensure it.
Cyan over 13 years

That's very interesting reading Mike. And yes, i'm after such gains. I can be very thorough in my testings, and loop many times to improve on each iteration. This sometimes ends up with stories such as yours, which is very pleasant reading. Best Regards
Admin over 2 years

Vars are heavy, please don't joke like that