Is it faster to count down than it is to count up?

c performance loops

22,821

Solution 1

Is it really true? and if so does anyone know why?

In ancient days, when computers were still chipped out of fused silica by hand, when 8-bit microcontrollers roamed the Earth, and when your teacher was young (or your teacher's teacher was young), there was a common machine instruction called decrement and skip if zero (DSZ). Hotshot assembly programmers used this instruction to implement loops. Later machines got fancier instructions, but there were still quite a few processors on which it was cheaper to compare something with zero than to compare with anything else. (It's true even on some modern RISC machines, like PPC or SPARC, which reserve a whole register to be always zero.)

So, if you rig your loops to compare with zero instead of N, what might happen?

You might save a register
You might get a compare instruction with a smaller binary encoding
If a previous instruction happens to set a flag (likely only on x86 family machines), you might not even need an explicit compare instruction

Are these differences likely to result in any measurable improvement on real programs on a modern out-of-order processor? Highly unlikely. In fact, I'd be impressed if you could show a measurable improvement even on a microbenchmark.

Summary: I smack your teacher upside the head! You shouldn't be learning obsolete pseudo-facts about how to organize loops. You should be learning that the most important thing about loops is to be sure that they terminate, produce correct answers, and are easy to read. I wish your teacher would focus on the important stuff and not mythology.

Solution 2

Here's what might happen on some hardware depending on what the compiler can deduce about the range of the numbers you're using: with the incrementing loop you have to test i<N each time round the loop. For the decrementing version, the carry flag (set as a side effect of the subtraction) may automatically tell you if i>=0. That saves a test per time round the loop.

In reality, on modern pipelined processor hardware, this stuff is almost certainly irrelevant as there isn't a simple 1-1 mapping from instructions to clock cycles. (Though I could imagine it coming up if you were doing things like generating precisely timed video signals from a microcontroller. But then you'd write in assembly language anyway.)

Solution 3

In the Intel x86 instruction set, building a loop to count down to zero can usually be done with fewer instructions than a loop that counts up to a non-zero exit condition. Specifically, the ECX register is traditionally used as a loop counter in x86 asm, and the Intel instruction set has a special jcxz jump instruction that tests the ECX register for zero and jumps based on the result of the test.

However, the performance difference will be negligible unless your loop is already very sensitive to clock cycle counts. Counting down to zero might shave 4 or 5 clock cycles off each iteration of the loop compared to counting up, so it's really more of a novelty than a useful technique.

Also, a good optimizing compiler these days should be able to convert your count up loop source code into count down to zero machine code (depending on how you use the loop index variable) so there really isn't any reason to write your loops in strange ways just to squeeze a cycle or two here and there.

Solution 4

Yes..!!

Counting from N down to 0 is slightly faster that Counting from 0 to N in the sense of how hardware will handle comparison..

Note the comparison in each loop

i>=0
i<N

Most processors have comparison with zero instruction..so the first one will be translated to machine code as:

Load i
Compare and jump if Less than or Equal zero

But the second one needs to load N form Memory each time

load i
load N
Sub i and N
Compare and jump if Less than or Equal zero

So it is not because of counting down or up.. But because of how your code will be translated into machine code..

So counting from 10 to 100 is the same as counting form 100 to 10
But counting from i=100 to 0 is faster than from i=0 to 100 - in most cases
And counting from i=N to 0 is faster than from i=0 to N

Note that nowadays compilers may do this optimization for you (if it is smart enough)
Note also that pipeline can cause Belady's anomaly-like effect (can not be sure what will be better)
At last: please note that the 2 for loops you have presented are not equivalent.. the first prints one more * ....

Related: Why does n++ execute faster than n=n+1?

Solution 5

In C to psudo-assembly:

for (i = 0; i < 10; i++) {
    foo(i);
}

turns into

    clear i
top_of_loop:
    call foo
    increment i
    compare 10, i
    jump_less top_of_loop

while:

for (i = 10; i >= 0; i--) {
    foo(i);
}

turns into

    load i, 10
top_of_loop:
    call foo
    decrement i
    jump_not_neg top_of_loop

Note the lack of the compare in the second psudo-assembly. On many architectures there are flags that are set by arithmatic operations (add, subtract, multiply, divide, increment, decrement) which you can use for jumps. These often give you what is essentially a comparison of the result of the operation with 0 for free. In fact on many architectures

x = x - 0

is semantically the same as

compare x, 0

Also, the compare against a 10 in my example could result in worse code. 10 may have to live in a register, so if they are in short supply that costs and may result in extra code to move things around or reload the 10 every time through the loop.

Compilers can sometimes rearrange the code to take advantage of this, but it is often difficult because they are often unable to be sure that reversing the direction through the loop is semantically equivalent.

View more solutions

22,821

Author by

Bob

Updated on November 29, 2020

Comments

Bob over 3 years
Our computer science teacher once said that for some reason it is more efficient to count down than to count up. For example if you need to use a FOR loop and the loop index is not used somewhere (like printing a line of N * to the screen) I mean that code like this:
```
for (i = N; i >= 0; i--)  
  putchar('*');  
```
is better than:
```
for (i = 0; i < N; i++)  
  putchar('*');  
```
Is it really true? And if so, does anyone know why?
Tommy Jakobsen almost 14 years

Never seen people complain about that before. But after reading the link it actually makes sense :) Thank you.
bobDevil almost 14 years

Um, why should he always use the prefix form? If there's no assignment going on, they are identical, and the article you linked to even says that postfix form is more common.
Ben Zotto almost 14 years

Why should one always use the prefix form? In this instance, it's semantically identical.
Nick Lewis almost 14 years

The postfix form can potentially create an unnecessary copy of the object, although if the value is never being used, the compiler will probably optimize it to the prefix form anyway.
Bob almost 14 years

noted that the loops are not the same, forgot to put the "=" sign in one of them
Bob almost 14 years

so what you're saying is it's not faster to count down, it's just faster to compare to zero than any other value. Meaning counting from 10 to 100 and counting down from a 100 to 10 would be the same?
Betamoo almost 14 years

Yes.. it is not the matter of "counting down or up".. but it is the matter of "comparing to what"..
Bob almost 14 years

wouldn't that be the zero flag and not the carry flag?
Rameez Rami almost 14 years

It's not "definitely faster". In many cases that size() call could be hoisted out of the loop when counting up, so it would still only get called once. Obviously this is language and compiler dependent (and code dependent; eg. in C++ it won't get hoisted if size() is virtual), but it's far from definite either way.
sigfpe almost 14 years

@Bob In this case you might want to reach zero, print a result, decrement further, and then find you've gone one below zero causing a carry (or borrow). But written slightly differently a decrementing loop might use the zero flag instead.
Mike Dunlavey almost 14 years

++ And besides, the putchar takes many orders of magnitude longer than the loop overhead.
Mark Ransom almost 14 years

I've seen Microsoft's C++ compiler from a few years back make that optimization. It's able to see that the loop index isn't used, so it rearranges it to the fastest form.
Paul Nathan almost 14 years

Just to be perfectly pedantic, not all modern hardware is pipelined. Embedded processors will have much more relevance to this sort of microoptimization.
Paul Nathan almost 14 years

It's not strictly mythology: if he is doing some sort of uber-optimized real-time system, it would come in handy. But that sort of hacker would probably know all this already and certainly wouldn't be confusing entry-level CS students with arcana.
sigfpe almost 14 years

@Paul As I have some experience with Atmel AVRs I didn't forget to mention microcontrollers...
dthorpe almost 14 years

@Mark: The Delphi compiler as well, starting in 1996.
Jay Conrod almost 14 years

+1 Also, if this were profitable, it's likely the compiler will optimize it anyway, so the programmer is free to write the loop for readability.
Lawrence Dol almost 14 years

To be clear and efficient you should be in the habit of at least for(int i=0, siz=myCollection.size(); i<siz; i++).
Lawrence Dol almost 14 years

@Peter: Only if the compiler knows for certain that size() is idempotent across the loop. That's probably nearly always not the case, unless the loop is very simple.
Lawrence Dol almost 14 years

Huh?!! You failing example is really counter-intuitive, which is to say, a straw-man argument - no one would ever write this. One would write for (int i=0; i < 999; i++) {.
vpalmu almost 14 years

@Jay, sorry, it wouldn't. The optimization tends to change program behavior detectably and the compiler can't perform the proof of correctness that this optimization is safe.
nico almost 14 years

so I guess the same holds for counting from -10 to 0 being "faster" (on some ancient system) then counting from 0 to -10, right?
Ivan_Bereziuk almost 14 years

While this is true the assembler level. Two things combine to meke untrue in reality -- modern hardware using long pipes and speculative instructions will sneak in the "Sub i and N" without incurring an extra cycle -- and -- even the crudest compiler will optimise the the "Sub i and N" out of existence.
Gabriel Ščerbák almost 14 years

@Software Monkey imagine n being a result of some computation... e.g. you might want to iterate over some collection and its size is the boundary, but as some side effect, you add new elements to the collection in the loop body.
Conspicuous Compiler almost 14 years

The reason it's slower is that the prefix operator doesn't need to store a temporary. Consider $foo = $i++; Three things happen: $i is stored to a temporary, $i is incremented, and then $foo is assigned that temporary's value. In the case of $i++; a smart compiler could realize the temporary is unnecessary. PHP just doesn't. C++ and Java compilers are smart enough to make this simple optimization.
psmears almost 14 years

@Joshua: In what way would this optimization be detectable? As the questioner said, the loop index isn't used in the loop itself, so provided the number of iterations is the same there is no change in behavior. In terms of a proof of correctness, making the variable substitution j=N-i shows that the two loops are equivalent.
ts. almost 14 years

and why $i-- is faster than $i++ ?
Eight-Bit Guru almost 14 years

How many iterations of your benchmark did you run? Did you clip outriders and take an average for each result? Was your computer doing anything else during the benchmarks? That ~0.5 difference could just be the result of other CPU activity, or pipeline utilisation, or... or... well, you get the idea.
ts. almost 14 years

Yes, here i am giving averages. Benchmark was runned on different machines, and difference is accidentally.
ts. almost 14 years

@Conspicuous Compiler => you know or you suppose?
vpalmu almost 14 years

@psmears most of the time where this optimization is useful it's detectable. You're right in this case it's not detectable.
Lawrence Dol almost 14 years

If that's what you intended to communicate, then that's what your example should illustrate: for(int xa=0; xa<collection.size(); xa++) { collection.add(SomeObject); ... }
Gabriel Ščerbák almost 14 years

@Software Monkey I wanted to be more general than just talk particularly about collections, because what I am reasoning about has nothing to do with collections
Lawrence Dol almost 14 years

Yes, but if you are going to reason by example, your examples need to be credible and illustrative of the point.
JeremyP almost 14 years

Out of force of habit, I always do --i and i++ because when I learned C computers usually had a register predecrement and postincrement, but not vice versa. Thus, *p++ and *--p were faster than *++p and *p-- because the former two could be done in one 68000 machine code instruction.
Donal Fellows almost 14 years

+1 for the Summary. Don't sweat it because on modern hardware it makes virtually no difference. It made virtually no difference 20 years ago either. If you think you have to care, time it both ways, see no clear difference, and go back to writing the code clearly and correctly.
Lambda Fairy over 12 years

PHP has so much interpretation overhead around it that little things such as incrementing/decrementing shouldn't make any difference whatsoever. I doubt there's any good reason why $i-- should have been any different to $i++.
Lambda Fairy over 12 years

I tend to read ++i as "increment i", which makes more sense than "i increment" as the postfix form implies. That's probably just me, though.
dthorpe over 12 years

@MarkRansom Actually, the compiler may be able to implement the loop using count down even if the loop index variable is used, depending on how it is used in the loop. If the loop index variable is used only to index into static arrays (arrays of known size at compile time), the array indexing can be done as ptr + array size - loop index var, which can still be a single instruction in x86. It's pretty wild to be debugging assembler and see the loop counting down but the array indices going up!
dthorpe over 12 years

@nico Doesn't have to be an ancient system. It just has to be an instruction set where there is a compare to zero operation which is in some way faster/better than the equivalent compare to register value. x86 has it in jcxz. x64 still has it. Not ancient. Also, RISC architectures often special-case zero. The DEC AXP Alpha chip (in the MIPS family), for example, had a "zero register" - read as zero, write does nothing. Comparing against the zero register instead of against a general register that contains a zero value reduces inter instruction dependencies and helps out of order execution.
Jeroen Wiert Pluimers about 12 years

@dthorpe yeah, that was great fun seeing that happen in the CPU view of the Delphi debugger :)
the swine over 11 years

Well, if you're using C++, it is always preferrable to use ++i and --i, because in case i happens to be a non-integral type (an iterator), the code for pre-increment and pre-decrement is simpler, as it doesn't require making a copy of the object as in the case of post-increment or post-decrement. It should almost always optimize away, but still ...
vol7ron about 11 years

Segway... until a year or so ago, this was still important in JavaScript. Reverse While loops were significantly faster than incremental loops. Though JS engines have since been re-optimized, you never know when it might be undone in the future. So it's nice to know, but certainly not in a C class, nor an intro-level.
Artur about 11 years

@Betamoo: I am often wondering why not better / more correct answers (which is yours) are not more appreciated by more votes and come to conclusion that too often on stackoverflow votes are influenced by reputation (in points) of a person that answers (which is very very bad) and not by the answer correctness
Danubian Sailor almost 11 years

The both constructs are exactly as easy to understand. There are some people that claim that if you have 3 or 4 repetitions, it's better to copy the instruction than to make a loop because it is for them easier to understand.
Danubian Sailor almost 11 years

I don't know if I should upvote for the body or downvote for the summary.
fuz almost 11 years

Actually today your compiler probably won't use the loop and jecxz instructions as they're slower than a dec / jnz pair.
dthorpe almost 11 years

@FUZxxl All the more reason not to write your loop in strange ways. Write human readable clear code and let the compiler do its job.
mabraham about 10 years

I think this is poor style, because it depends on the reader knowing that the return value of i-- is the old value of i, for the possible value of saving a cycle. That'd only be significant if there were lots of loop iterations, and the cycle was a significant fraction of the length of the iteration, and actually showed up at run time. Next, someone will try for (i=5; --i;) because they've heard that in C++ you might want to avoid creating some temporary when i is a non-trivial type, and now you're in bug land having callously thrown away your opportunity to make wrong code look wrong.
Ben Voigt about 9 years

@Artur: Because this answer isn't better. The supposed "compare-to-zero" instruction either doesn't exist or, in the case of jcxz that dthorpe mentioned, is slower that the alternative. The advantage is real, not because of some supposed special "compare-to-zero" instruction, but because the compare-to-zero flags are generally set for free as a side effect of another instruction. But even worse than that is the absurd claim in this answer that "the second one needs to load N form [sic] Memory each time". A compiler that can't figure out how to enregister a constant is absolute garbage.
Cld almost 9 years

Well, not good to say it's make no difference, in certain usage it's can be very important. I talking about low power embedded solution, like when your sensor work on energy harvesting, or you need to work on battery for 10 years and you wake-up for some millisecond every hours... But clearly a bad thing to say to student... Make your code clear and easy to read is lot more important... And on "normal" computer it's change nothing, it's even in numberous case counterproductive because compiler optimization or cache system doesn't work backward.
cmaster - reinstate monica about 7 years

Note about the PPC processor: The Branch and Decrement if Not Zero instruction (bdnz) was the only loop instruction within the loop. Since it also executed on the branch unit with a single cycle latency, it was always executed parallel to some other instruction within the loop. So, basically, bdnz based loop control was for free. This was a nice micro-optimization compared to the fully data dependent increment-compare-branch upward counting which would have used the integer unit for two cycles, and wasted two registers.
Pacerier over 6 years

@dthorpe, Does today's Intel still loop faster backwards? Or are they equal now?
Pacerier over 6 years

Is it possible that there is a diff of 2 instructions instead of only 1?
Pacerier over 6 years

Also, why is it hard to be sure of that? As long as the var i is not used within the loop, obviously you can flip it over isn't it?
Pacerier over 6 years

@LawrenceDol, The compiler will definitely know it unless you have dynamic code compilatino using exec.
Oliver Tušla over 2 years

Can you please recommend any literature/blog/video that talks about the direction of memory access and its performance?
Matthew K. over 2 years

@OliverTušla It's called cache prefetching en.wikipedia.org/wiki/Cache_prefetching and you should be able to find more info by searching for this term. I don't remember where I first learned of it but my original source would probably be outdated by now anyway. Cache control instructions might also interest you. More generally, you can also try searching for "cache performance" or "memory access optimization", etc. Implementation of cache prefetching is of course hardware dependent.