Performance optimization strategies of last resort

performance optimization language-agnostic

83,978

Solution 1

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.

I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED: The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.

EDIT 2011/11/26: There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Solution 2

Suggestions:

Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.

Solution 3

When you can't improve the performance any more - see if you can improve the perceived performance instead.

You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.

A few examples:

anticipating what the user is going to request and start working on that before then
displaying results as they come in, instead of all at once at the end
Accurate progress meter

These won't make your program faster, but it might make your users happier with the speed you have.

Solution 4

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:

Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.

And one more thing I like to do:

Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.

Solution 5

Throw more hardware at it!

View more solutions

83,978

Author by

jerryjvl

Updated on July 11, 2022

Comments

jerryjvl almost 2 years
There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

Let's assume:
- the code already is working correctly
- the algorithms chosen are already optimal for the circumstances of the problem
- the code has been measured, and the offending routines have been isolated
- all attempts to optimize will also be measured to ensure they do not make matters worse
What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.
Stefan Thyberg about 15 years

My thoughts exactly. When you start talking about "last few percents" when there is nothing left, the clearly cheapest way of making it run faster is faster hardware rather than spending a ton of programmer time on squeezing those last percents out of it.
jerryjvl about 15 years

+1 an answer that especially managers seem to overlook too often ;)
jerryjvl about 15 years

Although I want to encourage more software-oriented answers as well, because sometimes you may already have developers with spare time on their hands, and no money to invest in hardware... or sometimes you may already have the fastest available hardware and still not enough performance. Especially considering that 'faster hardware' is looking less and less likely to be a never-ending path, and not everything parallellises
Adam Rosenfield about 15 years

Precomputation doesn't always help, and it can even hurt sometimes -- if your lookup table is too big, it can kill your cache performance.
Doug T. about 15 years

more hardware isn't always an option when you have software that is expected to run on hardware already out in the field.
Crashworks about 15 years

Not a very helpful answer to someone making consumer software: the customer isn't going to want to hear you say, "buy a faster computer." Especially if you're writing software to target something like a video game console.
timday about 15 years

If you believe Moore's Law runs at x2/18 months, that's a compounded rate of ~0.13%/day (and it works weekends and doesn't take holidays :^). Busting a gut on last-few-percent optimization has to be weighed against what you can get for free by just waiting. If you take a long enough view, this applies just as much in the console/consumer/embedded space (the performance graphs are just more quantized), and you can still make a case for moving on to the next thing rather than investing time on something with small and diminishing returns.
jerryjvl about 15 years

What profiler(s) do you use? And are any of them applicable to C# .NET development? I'd love to try some of these myself.
RBerteig about 15 years

@Crashworks, or for that matter, an embedded system. When the last feature is finally in and the first batch of boards are already spun is not the moment to discover that you should have used a faster CPU in the first place...
RBerteig about 15 years

Cheating can often be the win. I had a color correction process that at the core was a 3-vector dotted with a 3x3 matrix. The CPU had a matrix multiply in hardware that left out some of the cross terms and went real fast compared to all the other ways to do it, but only supported 4x4 matrices and 4-vectors of floats. Changing the code to carry around the extra empty slot and converting the calculation to floating point from fixed point allowed for a slightly less-accurate but much faster result.
jerryjvl about 15 years

I'd love to read some of the details of the steps you outline above. Is it possible to include some fragments of the optimizations for flavour? (without making the post too long?)
Mike Dunlavey about 15 years

@jerryjvl: I'll see if I can post the sequence of sources (and maybe page images) somewhere. Bear with me.
Mike Dunlavey about 15 years

... I'm not sure if I've got copyright issues with Dr. Dobbs.
Mike Dunlavey about 15 years

... I also wrote a book that's now out of print, so it's going for a ridiculous price on Amazon - "Building Better Applications" ISBN 0442017405. Essentially the same material is in the first chapter.
Venedictos about 15 years

Should be "switch from GCC to LLVM" :)
Crashworks about 15 years

I mostly use Intel VTune and PIX. No idea if they can adapt to C#, but really once you've got that JIT abstraction layer most of these optimizations are beyond your reach, except for improving cache locality and maybe avoiding some branches.
jerryjvl about 15 years

Even so, checking on the post-JIT output may help figure out if there are any constructs that just do not optimize well through the JIT stage... investigation can never hurt, even if turns out a dead end.
Mike Dunlavey almost 15 years

+ Yes, these are things that can be done when you have found a problem. The part I would emphasize is the "finding" skill. Most folks would say "All you gotta do is profile / measure", but to me, that is not enough. What's more, the repetition is key.
MSN almost 15 years

Technically, that's the perfomance strategy of first resort.
Ankit Roy almost 14 years

I once had to debug a program that had a huge memory leak -- its VM size grew by about 1Mb per hour. A colleague joked that all I needed to do was add memory at a constant rate. :)
Olof Forshell about 13 years

More hardware: ah yes the mediocre developer's lifeline. I don't know how many times I've heard "add another machine and double the capacity!"
matbrgz about 13 years

Amazon.co.uk link: amazon.co.uk/Building-Better-Applications-Efficient-Developm‌ent/…
Mike Dunlavey about 13 years

@Thorbjørn: Yeah. Pretty amazing huh? I've only got 2 copies myself (plus 3 copies in Chinese) that I'm hanging onto.
matbrgz about 13 years

@Mike Dunlavey, Considered allowing Google to scan it in?
Mike Dunlavey about 13 years

@Thorbjørn: I guess I don't know how to do that, also I don't know if it's allowed, especially since the publisher has been bought out a couple times over. I have scanned it myself and sent it to a couple interested people, but in jpg form it's a bit massive to email.
matbrgz about 13 years

@Mike Dunlavey, I would suggest telling Google you have it scanned in already. They probably already have an agreement with whoever bought your publisher.
Sylverdrag about 13 years

All depends on how much of a problem it is for the user, and whether or not you can create more value for the users by spending the same amount of time in creating new features. Some programmers are willing to spend 6 months shaving off a couple seconds on a little used feature, but refuse to spend 5 minutes to implement a shortcut that would save several minutes of work every day for the users. All in the name of performance. And no, it's not a rhetorical example
BlueRaja - Danny Pflughoeft about 13 years

I think many people, including myself, would be interested in this "wtf assembly" produced by gcc. Yours sounds like a very interesting job :)
justin about 13 years

Can you make your algorithm run in parallel? -- the inverse also applies
Johan Kotlinski about 13 years

True that, reducing amount of threads can be an equally good optimization
sehe almost 13 years

Though the question is language-agnostic, let me mention that with the advent of c++0x (including move semantics and extended const rvalue reference lifetime extensions) the compiler will (many times) be able to elide copies (NRVO, URVO) but only if the parameter was passed by value. End answer: profile and understand your hotspots
Andrew Neely almost 13 years

That would be a valid approach after all other approaches failed, or if a specific OS or Framework feature was responsible for markedly decreased performance, but the level of expertise and control needed to pull that off may not be available to every project.
Mike Dunlavey over 12 years

@Thorbjørn: Just to follow up, I did hook up with GoogleBooks, filled out all forms, and sent them a hard copy. I got an email back asking if I really really owned the copyright. The publisher Van Nostrand Reinhold, which was bought by International Thompson, which was bought by Reuters, and when I try to call or email them it's like a black hole. So it's in limbo - I haven't yet had the energy to really chase it down.
matbrgz over 12 years

@RBerteig, out of curiosity - which CPU was this?
RBerteig over 12 years

@Thorbjørn, it was an Hitachi SH4 family member. I'd have to dig into a dusty archive box to identify the specific part number. At least it did have hardware floating point support, so switching from fixed point to floating point didn't introduce any other huge burdens.
Bryan Boettcher over 12 years

+1 for the flyswatter "smack" sound I heard while reading the last sentence.
leftaroundabout over 12 years

@RBerteig how was this cheating? in using floating point? That's not cheating, floating point is more accurate than integer for data that was originally analog (like the color correction algorithm probably was fed with). What's really cheating is to use integers to represent analog data!
RBerteig over 12 years

The cheating was in using a matrix multiply that left out some of the inner products, making it possible to implement in microcode for a single CPU instruction that completed faster than even the equivalent sequence of individual instructions could. Its a cheat because it doesn't get the "correct" answer, just an answer that is "correct enough".
vsz over 12 years

Not at all. Not everyone developes web applications or overdesigned GUIs with memory-managed develomepent systems which run on high-end systems with gigabytes of RAM, and you can allow a small integer to use kilobytes of memory. Even today, there is need for systems where the entire hardware cost for the embedded electronics must be under 1$. For such firmware, every byte counts even today.
Marco van de Voort over 12 years

I actually work in embedded, and even there you don't for 99.9% of the times. (microchip, motor control) And the few that are (like for small bulk electronic equipments) are mostly based on older ones. Not much new development going on there. Give or take the odd new "wellness" category device.
vsz over 12 years

I work in embedded too, and I need to use it all the time. Maybe you work in an area where you have a big enough processor, lot of space on your layout, and the number of units produced is very small compared to the task complexity, so the developement costs are more important than the unit costs. That is not always the case.
ysap over 12 years

I'm totally with @vsz here. So many times have I run into memory problems even when programming a moderately large DSP. It really depends on the specific application being implemented. For example, trying to push a modem software on a BlackFin DSP, you run into a wall very early in the design process... which forces you to start thinking cleverly and out-of-the-box on solving the memory issues.
Bryan Boettcher over 12 years

Be careful doing this. Somebody wasn't when they wrote the firmware for my thermostat, and if I leave it in "auto" mode (to switch between hot and cool) it will blast the AC when it hits -1F outside because of a signed -> unsigned conversion. The thermostat thinks it's 255F out :(
Olof Forshell about 12 years

A state of the art server park running a managed web server application often consumes huge amounts of power because the application is inefficient to begin with. Maybe power consumption plays last fiddle to having something up and running. Why shouldn't architecture goals also aim for a target environment requiring just a few servers? It isn't THAT difficult.
Emil Vikström almost 12 years

A progress bar speeding up at the end may be perceived as faster than an absolutely accurate one. In "Rethinking the Progress Bar" (2007) Harrison, Amento, Kuznetsov and Bell tests multiple types of bars on a group of users as well as discussing some ways to rearrange the operations so that the progress may be perceived as faster.
Billy ONeal over 11 years

Examples on the PowerPC ... <-- That is, some implementations of PowerPC. PowerPC is an ISA, not a CPU.
Crashworks over 11 years

@BillyONeal True, but until someone manufactures a PPC implementation that doesn't have a microcoded imul and idiv, the distinction is sort of academic.
Billy ONeal over 11 years

@Crashworks: A G5 or a full blown Cell/POWER6/POWER7 (the server version) require microcode for a simple multiply? I can see divide being microcoded but I find multiply unusual.
Crashworks over 11 years

@BillyONeal For integer multiply, yes, on all those cores. I think it's probably because of the 128-bit precision (64 bits times 64 bits needs a high and low word)... it might be more precise to say that it's nonpipelined rather than microcoded since I don't know the internal circuitry. Floating point math is pipelined on those implementations, although the pipe is quite deep so you can't actually read the result for many cycles without stalling. I spent over a year of my life fretting over the exact cycle count for Cell opcodes and manually scheduling around the pipeline.
Martin Thompson over 11 years

@RBerteig: just "correct enough" is an opportunity for optimisation that most people miss in my experience.
Billy ONeal over 11 years

@Crashworks: :sigh: wow, that's insane. I can't believe there are still CPUs in common use that require microcoded instructions for every array access.
Billy ONeal over 11 years

Note that "moving the IO to a parallel thread" should be done as asynchronous IO on many platforms (e.g. Windows NT).
Crashworks over 11 years

@BillyONeal Even on modern x86 hardware, imul can stall the pipeline; see "Intel® 64 and IA-32 Architectures Optimization Reference Manual" §13.3.2.3: "Integer multiply instruction takes several cycles to execute. They are pipelined such that an integer multiply instruction and another long-latency instruction can make forward progress in the execution phase. However, integer multiply instructions will block other single-cycle integer instructions from issuing due to requirement of program order." That's why it's usually better to use word-aligned array sizes and lea.
Billy ONeal about 11 years

@Crashworks: I could see it having higher latency than, say, addition. But 40-50 cycle and/or pipeline stalls are kind of extreme for that. (Particularly given that lea is a multiply instruction and is frequently abused for this purpose)
Emil Vikström almost 11 years

naxa, most progress bars are fake because predicting multiple widely differing steps of a flow into a single percentage is hard or sometimes impossible. Just look at all those bars that gets stuck at 99% :-(
cmaster - reinstate monica almost 11 years

Add this: Never assume your algorithm is optimal unless you can prove it. Frequently, it is possible to use cheating/application specific knowledge to drop your algorithm into a lower complexity class. Would you have known, for instance, that there are cases where sorting can be done in O(n) instead of O(nlogn)? Ponder this on an extended walk / trafic jam / sleepless night / whatever helps you to have bright ideas...
cmaster - reinstate monica almost 11 years

I/O is indeed a critical point, because it is slow and has huge latencies, and you can get faster with this advice, but it's still fundamentally flawed: The points are the latency (which has to be hidden) and the syscall overhead (which has to be reduced by reducing the number of I/O calls). Best advice is: use mmap() for input, do appropriate madvise() calls and use aio_write() to write large chunks of output (= a few MiB).
Andreas Reiff almost 11 years

About looping backwards: yes, the comparison for loop end will be faster. Typically you use the variable to index into memory though, and accessing it reversed may be counter productive due to frequent cache misses (no prefetch).
BobMcGee almost 11 years

This last option is fairly easy to implement in Java, especially. It gave HUGE performance increases for applications I've written. Another important point (more than moving I/O upfront) is to make it SEQUENTIAL and large-block I/O. Lots of small reads is far more expensive than 1 big one, due to disk seek time.
varnie over 10 years

I would add to your first par.: do not forget turning off all the debugging info in your compiler options.
Mike Dunlavey over 10 years

@Demetri: Try this: sourceforge.net/projects/randompausedemo
Venemo over 10 years

This answer actually made me laugh :D
matbrgz almost 10 years

Google Books link: books.google.dk/books?id=8A43E1UFs_YC
Mike Dunlavey almost 10 years

@Thorbjørn: Thanks for noticing that, and I hope you are doing well. It still only shows the first chunk of the book. I kinda gave up trying to figure out who owns the copyright and posting the whole thing. Sometimes I email it to people who ask.
Gaius almost 10 years

This is why I like that in OCaml, casting between numeric types must be xplicit.
v.oddou almost 10 years

You cannot always assume that everybody is more intelligent than you. At the end we are all profesionnals. You can assume however, that a specific library that you use exists and has reached your environment because of its quality, therefore the writing of this library must be very thorough, you can't do it as well only because you are not specialized in that field, and you dont invest the same kind of time in it. Not because you are less smart. come on.
ideasman42 almost 10 years

@Gaius fair point - but in many cases changing languages isn't a realistic choice. Since C/C++ are so widely used its useful to be able to make them more strict, even if its compiler specific.
Peter Cordes almost 9 years

@Crashworks: That Intel ref manual must be old (or not update). On current Intel designs, imul r64 is 2 uops, 3 cycle latency, and is fully pipelined (can issue 1 per cycle, on port1). On P4, imul r32 is 16 cycle latency, recip throughput one per 8 cycles. lea is 1 or 3 cycle latency, (simple or complex/rip-relative addressing mode). lea is often really useful as a non-destructive shift+add+add_offset. Often a good move is to increment your loop counter by 8 or 16, instead of having to multiply or use a complex addressing mode.
Devolus over 8 years

I just optimized some code, that ran ~11 hours and couldn't be fully completed because the 64GB RAM were not enough. The newer hardware (128GB RAM and faster CPU/disks) took for the same data about 6.5 hours, which was abviously better. After my optimizations the OLD machine took 1:40 minutes and could complete the whole data with the 64GB RAM while the new machine is now down to 1:15 minutes total time. So yes, adding hardware can help, but proper optmization, that includes the whole production process, can drastically improve utillization. And such servers are not exactly cheap either.
underscore_d over 8 years

AFAIK, in most cases, any reasonable optimiser will do just fine with loops, without the programmer having to explicitly run in reverse. Either the optimiser will reverse the loop itself, or it has another way that's equally good. I've noted identical ASM output for (admittedly relatively simple) loops written both ascending vs max and descending vs 0. Sure, my Z80 days have me in the habit of reflexively writing backwards loops, but I suspect mentioning it to newbies is usually a red herring/premature optimisation, when readable code & learning more important practices should be priorities.
underscore_d over 8 years

not useful: C++11 aside (but very good point), in the languages that allow passing by value, avoiding it when appropriate is surely a basic mantra at the early levels of learning, not an "optimisation strategy of last resort" as asked in the question.
Peter Cordes over 6 years

Load-hit-store is also a PowerPC-specific problem. Or at least it doesn't affect x86 at all. FP store / integer reload is no slower than FP store / FP reload, and the other direction is fine too. (FP includes SSE/AVX). On typical Intel CPUs, it's about 6 cycle store-forwarding latency. (Of course, there are also ALU instructions to move data directly between vector and integer registers, with lower latency but potentially worse throughput for getting all the elements of a vector.)
Peter Cordes over 6 years

re: micro-optimizing: if you check the compiler's asm output, you can often tweak the source to hand-hold it into producing better asm. See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? for more about helping or beating the compiler on modern x86.
Crashworks over 6 years

@PeterCordes Moving data between SSE registers and integer registers used to be abysmally slow because of an LHS (hidden inside the architecture). Maybe that's improved since 2013, I haven't timed it recently.
Peter Cordes over 6 years

@Crashworks: Are you talking about AMD CPUs? integer<->xmm registers isn't great on AMD, but it's just a medium-latency ALU instruction at worst, not a mis-speculation that requires a rollback or anything. It's always been single-cycle on Intel for movd eax, xmm0 or movd xmm0, eax. On AMD Jaguar, those are 4 and 6 cycle latency respectively, with one per clock throughput, according to agner.org/optimize. They're worse on Bulldozer/Piledriver, like 8 and 10c latency, but still good throughput and doesn't stall the pipeline or cause mis-speculation. So it's not a "load hit store"
Peter Cordes over 6 years

Bulldozer has slow-ish store-forwarding normally, even integer store / integer reload is 8c round trip vs. 5c on Intel. (Or even 4c with simple addressing modes on Skylake.)
Jack G over 6 years

A good example of "cheating" is to sometimes use faster methods for unintended purposes than slower methods for intended purposes. For example, if you are simulating a "drug trip" in a video game, instead of adding "rand()" to every vertex in the renderer, you could instead use en.wikipedia.org/wiki/Fast_inverse_square_root with single precision floating point decimal everywhere, resulting in unpreccidentedly more performant blurriness.
Jack G over 6 years

On the contrary, running a loop backwards will be slower in lower level languages because in a war between comparison to zero plus additional subtraction vs a single integer comparison, the single integer comparison is faster. Instead of decrementing, you can have a pointer to the start address in memory and a pointer to the end address in memory. Then, increment the start pointer till it is equal to the end pointer. This will eliminate the extra memory offset operation in the assembly code, thus proving much more performant.
M.D. about 5 years

At one point I cheated in avoiding I/O, by just temporarily moving all the files to a RAM disk before the computation and moving them back afterwards. This is dirty, but might be useful in situation where you do not control the logic that makes the I/O calls.