Coding Practices which enable the compiler/optimizer to make a faster program

c++ c performance optimization

18,529

Solution 1

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

Solution 2

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:

Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.

Next, profile your code.

Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.

Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Solution 3

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:

#define N 1000000;
int matrix[N][N] = { ... };

//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[i][j];
  }
}

//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[j][i];
  }
}

Solution 4

Generic Optimizations

Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.

Declare small functions as `inline` or macros

Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)

Remove dead and redundant code

If the code isn't used or does not contribute to the program's result, get rid of it.

Simplify design of algorithms

I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.

Loop Unrolling

Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.

Edit: provide an example of loop unrolling Before:

unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

After unrolling:

unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
    sum += *buffer++; // 1
    sum += *buffer++; // 2
    sum += *buffer++; // 3
    sum += *buffer++; // 4
    sum += *buffer++; // 5
    sum += *buffer++; // 6
    sum += *buffer++; // 7
    sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.

I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.

Reduction of `if` statements

Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:

bool status = true;
status = status && /* first test */;
status = status && /* second test */;

The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.

Example:

struct Reader_Interface
{
  virtual bool  write(unsigned int value) = 0;
};

struct Rectangle
{
  unsigned int origin_x;
  unsigned int origin_y;
  unsigned int height;
  unsigned int width;

  bool  write(Reader_Interface * p_reader)
  {
    bool status = false;
    if (p_reader)
    {
       status = p_reader->write(origin_x);
       status = status && p_reader->write(origin_y);
       status = status && p_reader->write(height);
       status = status && p_reader->write(width);
    }
    return status;
};

Factor Variable Allocation outside of loops

If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.

Factor constant expressions outside of loops

If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.

I/O in blocks

Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:

static const char  Menu_Text[] = "\n"
    "1) Print\n"
    "2) Insert new customer\n"
    "3) Destroy\n"
    "4) Launch Nasal Demons\n"
    "Enter selection:  ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);

The efficiency of this technique can be visually demonstrated. :-)

Don't use `printf` family for constant data

Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.

Format to memory, then write

Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.

Declare constant text (string literals) as `static const`

When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.

Lastly, Code like the compiler would

Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

Solution 5

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.

That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.

Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.

Which leads to the next point, read the ^#$@ manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

View more solutions

18,529

Author by

EvilTeach

Everything I published here is under CC-BY-SA 3.0. I do not give permission for it to be relicensed. I prefer to be addressed as Esquire, or Asshole which ever one you feel applies at the time.

Updated on June 07, 2022

Comments

EvilTeach almost 2 years
Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.

As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.

FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.

What coding practices are available that may enable the compiler/optimizer to generate faster code?
- Identifying the platform and compiler you use, would be appreciated.
- Why does the technique seem to work?
- Sample code is encouraged.
Here is a related question

[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?

[Edit] Offset related link
Vatine over 14 years

At least when it comes to self-calls. Generalised tail-call optimisation is hard in, for example, C, due to the calling conventions most platforms use.
Toad over 14 years

the question is about C. C does not remove tail-recursion, so tail or other recursion, the stack might blow if the recursion goes too deep.
EvilTeach over 14 years

I have avoided the calling convention issue, by using a goto. There is less overhead that way.
Hogan over 14 years

@reinier: what do you mean... a good c compiler could optimize tail end recursion.
Hogan over 14 years

@vatine: true I was thinking self-calls.
Toad over 14 years

@hogan: this is new to me. Could you point to any compiler which does this? And how can you be sure it actually optimizes it? If it would do this one really needs to be sure it does it. It's not something you hope the compiler optimizer picks up on (like inlining which may or may not work)
Toad over 14 years

@hogan: I stand corrected. You are right that Gcc and MSVC both do tail recursion optimization.
Brian Young over 14 years

This example isn't tail recursion as its not the recursive call that is last, its the multiplication.
Potatoswatter over 14 years

Not much, rarely. It does improve actual correctness, though.
Hogan over 14 years

@reinier: For grad school I wrote a compiler that did it in a C like language (about 70% of C) it is really easy.
dsimcha over 14 years

In C and C++ the compiler can't use const to optimize because casting it away is well-defined behavior.
LiraNuna over 14 years

That is not always the case. There are still things that the compiler has insufficient data to optimize and perform necessary modifications. Here is an example - liranuna.com/sse-intrinsics-optimizations-in-popular-compile‌rs
oz10 over 14 years

Putting functions that are used together in close physical proximity in source increases the likelihood that they will be near each other in object files and near each other in your executable. This improved locality of instructions can help avoid instruction cache misses while running.
EvilTeach over 14 years

The AIX compiler has a compiler switch to encourage that behavior -qipa[=<suboptions_list>] | -qnoipa Turns on or customizes a class of optimizations known as interprocedural analysis (IPA).
Dave Jarvis over 14 years

Compiier developers have finite time, just like everyone else. Not all optimizations will make it into the compiler. Like & vs. % for powers of two (seldom, if ever, optimized, but can have significant performance impacts). If you read a trick for performance, the only way to know if it works is to make the change and measure the impact. Never assume the compiler will optimize something for you.
Potatoswatter over 14 years

& and % is pretty much always optimized, along with most other cheap-as-free arithmetic tricks. What doesn't get optimized is the case of the right-hand operand being a variable that just happens always to be a power of two.
Hogan over 14 years

@Roger+@Brian: My compiler was able to optimize the code as I coded it. This was back in 95, I'm guessing there is a modern language that does tail-end optimization requiring code like @Roger wrote so this is why you are all making crazy claims that the original code could not be optimized.
EvilTeach over 14 years

Strictly speaking this is not an optimizer issue, but is an optimization issue.
High Performance Mark over 14 years

To clarify, I seem to have confused some readers: the advice in the coding practice I propose is to first develop a straightforward code which does not make use of memory-layout instructions to establish a baseline of performance. Then, try things one at a time and measure their impact. I have not offered any advice on the performance of operations.
Hogan over 14 years

+1 : const is a good example of something that will directly impact the compiled code. re @dsimcha's comment -- a good compiler will test to see if this happens. Of course, a good compiler will "find" const elements that are not declared that way anyway...
Potatoswatter over 14 years

Yep, it's easy to shoot yourself in the foot with fine-grained memory layout, and never know it because the simple layout is too different. What do you mean by "memory layout instructions", vector pack/unpack?
Hogan over 14 years

-1 Question Nullification. The whole point is to ask people on SO who have that experience. Many of us write compilers + optimizers.
Hogan over 14 years

Best is to have a way to develop that does not require this. Using this fact as an excuse to write un-modular code will overall just result in code that is slow and has maintenance problems.
Mike DeSimone over 14 years

The last time I laid stuff out in memory to optimize cache performance was for the Intel i860. The only reason I did that was because the OS docs said, "you will do this and you will use this buffer to do it or all our custom optimizations will fail." I hate that architecture to this day.
Porculus over 14 years

For constant power-of-two n, gcc replaces % n with & (n-1) even when optimisation is disabled. That's not exactly "seldom, if ever" ...
user168715 over 14 years

I'm not sure this is sound advice, at least as currently written. How you lay out your data in memory, at least in the big picture --- such as whether you allocate on the stack or the heap, whether the memory being read inside your inner loop is spatially localized or not, etc --- is very much the programmer's responsibility, and has a huge effect on performance. If you do these things wrong no C++ compiler I know of can fix it for you.
Phil Miller over 14 years

This is a good answer, but note that whole-program optimization is becoming more popular, and can in fact inline functions across translation units.
Phil Miller over 14 years

Sure it's an optimizer issue. People have been writing papers about automatic loop interchange optimization for decades.
Aaronaught over 14 years

@Hogan: Call a spade a spade, this is premature optimization at its finest. Absolutely none of the upvoted answers here actually answer the question directly; they all talk about things that the optimizer doesn't optimize, not how make things "easier" for the optimizer.
celion over 14 years

+1 for load-hit-stores and different register types. I'm not sure how big of a deal it is on in x86, but they're devestating on PowerPC (e.g. Xbox360 and Playstation3).
Hogan over 14 years

@Aaronaught: I think my answer does address that. In theory any recursive routine can be made flat and thus optimized. It is easy for compilers to optimize true or close-to-true tail end recursion. This might change how you write code to help the optimizer.
Aaronaught over 14 years

@Hogan, I frankly do not understand your fascination with tail recursion, and simply saying that compilers can optimize it isn't actually an answer to this question. Compilers can inline functions too and do a lot of other things; so what? Rarely will that affect how you actually need to write the code, and almost never will it translate into a real-world performance improvement. Why rely on the compiler to transform your tail recursion into a loop when some compilers might not do it? Just write a loop in the first place!
Michael Burr over 14 years

This has the added benefit of often making things easier to read/understand for programmers as well, since they don't have to worry about possible non-obvious side-effects either.
Phil Miller over 14 years

Most papers on compiler loop optimization techniques assume perfect nesting, which means that the body of each loop except the innermost is just another loop. These papers simply do not discuss the steps necessary to generalize such, even if it's very clear that they can be. Thus, I'd expect a lot of implementations to not actually support those generalizations, because of the extra effort entailed. Thus, the many algorithms for optimizing cache usage in loops might work a lot better on perfect nests than on imperfect nests.
Hogan over 14 years

@Aaronaught - I don't understand it either. Mostly it is the only answer to this question I could think of -- I know lots of ways to speed up / optimize programs, but making your code TER friendly is the only one I can think of that addresses the question, as you pointed out.
ephemient over 14 years

@Potatoswatter What are you talking about? The C compiler can do whatever it wants to as long as the same final result is observed, and indeed GCC 4.4 has -floop-interchange which will flip an inner and outer loop if the optimizer deems it profitable.
Potatoswatter over 14 years

Huh, well there you go. C semantics are often marred by aliasing issues. I guess the real advice here is to pass that flag!
EvilTeach over 14 years

You are getting close to answering the question..... what sorts of things do you do to the code, to make it possible for the compiler to do those sorts of optimizations?
EvilTeach over 14 years

Interesting can you provide an example where you got better code with a few small statements, instead of a larger one. Can you show an example of rewriting a if, using booleans. Generally, I would leave the loop unrolling to the compiler, as it probably has a better feel for cache size. I'm a bit surprised about the idea of sprintfing, then fwriting. I would think that fprintf actually does that under the hood. Can you give a bit more detail here?
EvilTeach over 14 years

@Aaronaught Hogan directly addressed the question. Some progammers will read his answer, and ask "What is tail recursion", read up on it, and maybe learn something. It is a win win answer.
Admin over 14 years

@Hogan: comments were just saying it wasn't tail recursion (which it wasn't), not that it couldn't be optimized. Since that detracted from your point (even if not implemented by all compilers), I changed it; that's all.
Wyzard over 14 years

I believe a typical implementation of std::vector doesn't actually construct more objects when you reserve() more capacity. It just allocates pages. The constructors are called later, using placement new, when you actually add objects to the vector -- which is (presumably) just before you call init(), so you don't really need the separate init() function. Also remember that even if your constructor is "empty" in the source code, the compiled constructor may contain code to initialize things like virtual tables and RTTI, so the pages get touched at construction time anyway.
Thomas Matthews over 14 years

There is no guarantee that fprintf formats to a separate buffer then outputs the buffer. A streamlined (for use of memory) fprintf would output all unformatted text, then format and output, and repeat until the entire format string is processed, thus making 1 output call for each type of output (formatted vs. unformatted). Other implementations would need to dynamically allocate memory for each call to hold the entire new string (which is bad in embedded systems environment). My suggestion reduces the number of outputs.
EvilTeach over 14 years

Yep. In our case we use push_back for populating the vector. The objects don't have any virtual functions, so it is not a problem. The first time we tried it with constructor, we were astounded by the volume of page faults. I realized what happened, and we yanked the guts of the constructor, and the page fault problem vanished.
jf. over 14 years

Trying to write more in C-style (vs. in C++) e.g. avoiding virtual functions w/o absolute need, especially if they going to be called often, avoid AddRefs.. and all cool stuff (again unless it really needed). Write code easy for inlining - fewer parameters, less "if"-s. Not use global variables unless absolute need. In data structure - put wider fields first (double, int64 goes before int) - so compiler align struct on first field natural size - aligning good for perf.
jf. over 14 years

Data layout and access are absolutely critical for performance. So after profiling - i sometimes break a structure into several ones following locality of accesses. One more general trick - use int or size-t vs. char - even data values are small - avoid various perf. penalties store to load blocking, issues with partial registers stalls. of course this doesn't applicable when need big arrays of such data.
jf. over 14 years

One more - avoid system calls, unless there is a real need :) - they are VERY expensive
jf. over 14 years

Enabling vectorization is specific to a loop. Often vectorization could be enabled by providing compiler a hint to ignore vector dependencies - if this is a case. More on this: intel.com/software/products/compilers/docs/clin/main_cls/…
EvilTeach about 14 years

Most IDEs display local variables by default, so there is less typing
Nils Pipenbrinck about 14 years

That used to be true, nowadays it is anymore. Infact the exactly the opposite is true. If you declare your arrays with powers of two you will very likely run into the situation that you work on two pointers a power of two apart in memory. The problem is, that the CPU caches is organized just like that and you may end up with the two arrays fighting around one cache-line. You get horrible performance that way. Having one of the pointers a couple of bytes ahead (e.g. non power of two) prevents this situation.
kriss about 14 years

@jf: I did +1 on your answer, but please could you move the answer from comments to answer body ? It will be easier to read.
Ben Voigt about 14 years

you can also enable that optimization by using restricted pointers
celion about 14 years

@Ben - that's true, but I think this way is clearer. Also, if the input and output did overlap, I believe the result is unspecified with restricted pointers (probably get different behavior between debug and release), whereas this way will at least be consistent. Don't get me wrong, I like using restrict, but I like not needing it even more.
David Thornley about 14 years

I once got a significant performance improvement by rolling up a loop. Then I figured out how to roll it up tighter by using some indirection, and the program got noticeably faster. (Profiling showed this particular function to be 60-80% of the runtime, and I tested performance carefully before and after.) I believe the improvement was due to better locality, but I'm not completely sure about that.
David Thornley about 14 years

That rather surprises me. What C++ and STL implementations were you using?
Tom about 14 years

I agree with the others, this sounds like a bad implementation of std::vector. Even if your objects did have vtables, they wouldn't be constructed until your push_back. You should be able to test this by declaring the default constructor to be private, because all vector will need is the copy-constructor for push_back.
Tom about 14 years

Converting recursion to a stack is an assumed optimization on ompf.org, for people developing raytracers and writing other rendering algorithms.
Tom about 14 years

...I should add to this, that the biggest overhead in my personal raytracer project is vtable-based recursion through a bounding-volume hierarchy using the Composite pattern. It's really just a bunch of nested boxes structured as a tree, but using the pattern causes data bloat (virtual table pointers) and reduces instruction coherency (what could be a small/tight loop is now a chain of function calls)
Tom about 14 years

That only works if the coupling "interfaces" themselves are amenable to optimization. An interface can be inherently "slow", e.g. by forcing redundant lookups or calculations, or forcing bad cache access.
Tom about 14 years

+1 Nils, and one specific occurrence of this is "64k aliasing" on Intel hardware.
Tom about 14 years

x86-64 bit also allow a lot of register-passed parameters, which can have a dramatic effect on function call overhead.
Tom about 14 years

@Aaronaught - You sound like you're advocating ignorance of optimization techniques to avoid related bad habits. That's a terrible trend in the community that leads to retardation of new members . But then again, I suppose that does make those of us with the knowledge more valuable, so at least it'll keep my salary high.
Aaronaught about 14 years

@Tom: You seem to be confusing code/design performance optimizations with compiler optimizations. The subset of programs whose performance characteristics are CPU-bound is vanishingly small; if you write I/O bound programs and waste hours making these types of micro-optimizations, you don't deserve that high salary. The only "retardation" I see here is reliance on penny-wise pound-foolish compiler tricks. I invite you to show me a single, modern, well-designed program other than an operating system kernel whose performance is significantly improved by such things.
Tom about 14 years

OK, here's an example: Programs that rely on SIMD (e.g. graphics/rendering/image processing) tend to require aligned data structures in order to avoid copying data all over the place. This is definitely in the realm of compiler optimizations, as it has to be done with compiler-specific instructions (__attribute__((aligned(x))) for GCC).
EvilTeach about 14 years

@David - The implementation was on AIX.
Skizz about 14 years

You've just got to hope that Foo doesn't have a copy operation defined that copies a couple of meg of data ;-)
phkahler about 14 years

@Aaronaught - Almost every program I've ever written is CPU bound. I have done database work, and have an appreciation for it, but I feel sorry for anyone who has never ventured out of that place. You're suffering from "all my experience is X, therefore everyone else's must also be X" syndrome.
Aaronaught about 14 years

@phkahler: You're incorrect, I've written CPU bound code as well. These kinds of cheap tricks rarely make a difference. Remember, this question is not asking about code optimization, compiler switches, that sort of thing, it's asking about how you'd rely on the implementation details of a particular optimizer in order to shave nanoseconds off the execution time.
celion about 14 years

@Skizz: True, the times that I've used this trick, "Foo" in question has been a plain-old-data struct, and "munge" was a fast, inlined method. In those cases, the few extra instructions spent copying were well worth avoiding the extra loads/stores that I avoided.
dash-tom-bang about 14 years

This is something that is easily disproved by looking at the disassembly, by the way. I was amazed, years ago, at seeing how gcc would optimize all sorts of constant multiplications with shifts and adds. E.g. val * 7 turned into what would otherwise look like (val << 3) - val.
dash-tom-bang about 14 years

I think the simple fact is that many solutions can be made faster by telling the computer to operate in the most efficient way. A good knowledge of what the compiler will do can help that, but the compiler won't be able to optimize everything. -- More importantly, imo, is that it is easier for a compiler to optimize simple code. I suspect the downvotes come in reaction to "don't try to outsmart..." Nobody's advocating tricking it, folks are saying, "work with the compiler."
R.. GitHub STOP HELPING ICE almost 14 years

The original question was about C, not C++, so both constructors and references are off-topic...
R.. GitHub STOP HELPING ICE almost 14 years

% CANNOT be optimized as & when the type is signed, due to C's idiotic rules for negative integer division (round towards 0 and have negative remainder, rather than rounding down and always having positive remainder). And most of the time, ignorant coders use signed types...
Ben Voigt over 13 years

@R: Unless of course, the compiler can prove that the variable is positive. Which actually is pretty often.
Adrian McCarthy almost 13 years

Many of these are programmer optimizations rather than ways for programmers to help the compiler to optimization, which was the thrust of the original question. For example, loop unrolling. Yes, you can do your own unrolling, but I think it's more interesting to figure out what roadblocks there are to the compiler unrolling for you and removing those.
Dietrich Epp almost 13 years

@dsimcha: Changing a const and restrict qualified pointer, however, is undefined. So a compiler could optimize differently in such a case.
Tom almost 13 years

@Ben - the compiler can't prove that the variable is positive if the function has external linkage. Which is almost all functions.
Ben Voigt almost 13 years

@Tom: Not quite. If the function is defined in the same compilation unit, the compiler is allowed to perform range analysis on it or even inline it, linkage doesn't matter. (If the linker chooses some other definition for which the analysis isn't valid, that's an ODR violation and behavior is undefined anyway).
bdonlan over 12 years

This is a pessimization if munge is not inlinable; see stackoverflow.com/questions/7761810/…
casualcoder over 12 years

This code likely does not optimize well, if munge() is not an inline function. Using a local variable seems to be a good idea for good coding practice, though. There is an example that works at stackoverflow.com/questions/7761810/…
celion over 12 years

People seem to be getting hung up on Foo and munge. Let's consider this instead: void sum(const float* vals, int numVals, float& sumOut){ sumOut = 0.0f; for (int i=0; i<numVals; i++) {sumOut += vals[i]; } } Somewhat contrived, but that would definitely benefit from making the sum a local variable.
Gratian Lup almost 12 years

It's actually not necessary to use subtraction when testing ranges, LLVM, GCC and my compiler at least do this automatically. Few people would probably understand what the code with subtraction does and even fewer why it actually works.
Seng Cheong over 11 years

@Novelocrat Yep - needless to say I was very surprised the first time I saw something from A.c get inlined into B.c.
Stephen Lin about 11 years

@dsimcha casting away const on a const reference or const pointer to a non-const object is well-defined. modifying an actual const object (i.e. one declared as const originally) is not.
Emmet almost 10 years

Neither C nor C++ make any guarantees about, or even mention, passing in particular registers or stack locations. It's the ABI (e.g. Linux ELF) that determines the details of parameter passing.
Ponkadoodle over 9 years

I think this information is slightly dated. In theory, the whole-program-optimization features built into many compilers now (e.g. "Link-time Optimization" in gcc) allow for the same benefits, but with a totally standard workflow (plus faster recompilation times than putting it all in one file!)
underscore_d about 8 years

@Wallacoloo For sure, this is faaar outta date. FWIW, I just used GCC's LTO for the first time today, and - all else being equal at -O3 - it blasted 22% of the original size off my program. (It's not CPU-bound, so I haven't got much to say about speed.)
underscore_d about 8 years

Yes, by now, LTO has made the first half of this post redundant and probably bad advice.
underscore_d about 8 years

This is interesting as an anecdote, but to be clear, it was a problem caused by the implementation of std::vector on that machine - not a piece of advice that is generally applicable, and it might be harmful (e.g. if some reader naively assumes they need to use init functions everwhere ;-)
EvilTeach about 8 years

Nah. Pretty much all systems use virtual memory.
kriss about 8 years

@underscore_d: there are still some issues (mostly related to visibility of exported symbols), but from a mere performance point of view there is probably no ned any more.
underscore_d about 8 years

Really? Can you cite a rationale and examples for this? Not saying it's untrue, just it sounds unintuitive that location would matter.
Mark Ransom about 8 years

@underscore_d it can't inline something until the function definition is known. While modern compilers might make multiple passes so that the definition is known at code generation time, I don't assume it.
underscore_d about 8 years

I'd assumed compilers work off abstract call graphs rather than physical function order, meaning this wouldn't matter. Sure, I suppose it doesn't hurt to be extra careful - especially when, performance aside, IMO it just seems more logical to define functions that are called before those that call them. I'd have to test performance but would be surprised if it mattered, but until then, I'm open to being surprised!
Flamefire about 6 years

@R.. ignorant coders use signed types That is not true (anymore) It is recommended to use signed types everywhere where you don't do bit operations. There are multiple talks on CppCon etc. and the CppCoreGuidelines (index type being signed) Reason is that the UB for over/underflow allows plenty of optimization and reduces bugs
EvilTeach over 5 years

in the example above, b() can't be called because if (x < 0) then a() will be called.
EvilTeach over 5 years

My rule of thumb is if you have to dynamically allocate, get an array so you don't need to do it again. Preallocate them vectors.
nategoose over 5 years

@EvilTeach No it won't. The comparison that results in the call to a() is !x
EvilTeach over 5 years

@nategoose. if x is -3 then !x is true.
nategoose over 5 years

@EvilTeach In C 0 is false and everything else is true, so -3 is true, so !-3 is false