Is < faster than <=?

c++ c performance assembly relational-operators

139,566

Solution 1

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:

A test or cmp instruction, which sets EFLAGS
And a Jcc (jump) instruction, depending on the comparison type (and code layout):
jne - Jump if not equal --> ZF = 0
jz - Jump if zero (equal) --> ZF = 1
jg - Jump if greater --> ZF = 0 and SF = OF
(etc...)

Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c

    if (a < b) {
        // Do something 1
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jge     .L2                          ; jump if a is >= b
    ; Do something 1
.L2:

And

    if (a <= b) {
        // Do something 2
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jg      .L5                          ; jump if a is > b
    ; Do something 2
.L5:

So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.

I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction.

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency

The values for Jcc are:

      Latency   Throughput
Jcc     N/A        0.5

with the following footnote on Jcc:

Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.

So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)

Edit: Floating Point

This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
        fstp    st(0)
        seta    al                     ; Set al if above (CF=0 and ZF=0).
        test    al, al
        je      .L2
        ; Do something 1
.L2:

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; (same thing as above)
        fstp    st(0)
        setae   al                     ; Set al if above or equal (CF=0).
        test    al, al
        je      .L5
        ; Do something 2
.L5:
        leave
        ret

Solution 2

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.

Comparison     Subtraction
----------     -----------
A < B      --> A - B < 0
A = B      --> A - B = 0
A > B      --> A - B > 0

Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.

There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.

Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.

Comparison     Subtraction  Carry Bit  Zero Bit
----------     -----------  ---------  --------
A < B      --> A - B < 0    0          0
A = B      --> A - B = 0    1          1
A > B      --> A - B > 0    1          0

So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,

;; Implementation of "if (A < B) goto address;"
cmp  A, B          ;; compare A to B
bcz  address       ;; Branch if Carry is Zero to the new address

But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.

;; Implementation of "if (A <= B) goto address;"
cmp A, B           ;; compare A to B
bcz address        ;; branch if A < B
bzs address        ;; also, Branch if the Zero bit is Set

So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.

Solution 3

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.

If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).

Solution 4

I see that neither is faster. The compiler generates the same machine code in each condition with a different value.

if(a < 901)
cmpl  $900, -4(%rbp)
jg .L2

if(a <=901)
cmpl  $901, -4(%rbp)
jg .L3

My example if is from GCC on x86_64 platform on Linux.

Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.

I noticed that if it is not a constant, then the same machine code is generated in either case.

int b;
if(a < b)
cmpl  -4(%rbp), %eax
jge   .L2

if(a <=b)
cmpl  -4(%rbp), %eax
jg .L3

Solution 5

For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:

int compare_strict(double a, double b) { return a < b; }

On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.

Now consider this function instead:

int compare_loose(double a, double b) { return a <= b; }

This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.

You might think that the compiler could optimize the second function like so:

int compare_loose(double a, double b) { return ! (a > b); }

However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.

View more solutions

139,566

Author by

Vinícius

Updated on May 12, 2022

Comments

Vinícius about 2 years

Is if (a < 901) faster than if (a <= 900)?

Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.
- Jason C about 10 years
  
  I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.
- Seng Cheong almost 10 years
  
  You never did tell us which book you're referring to.
- Deqing about 8 years
  
  Typing < is two times faster than typing <=.
- Rick Henderson almost 8 years
  
  This is an excellent question and would be interested in how it works involving an interpreted language such as Python. Would consider posting a new question such as "Is > faster than >= in Python?" but that could be considered a duplicate question. Guidance welcome.
- vpalmu over 7 years
  
  It was true on the 8086.
- m93a about 6 years
  
  The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.
- jcazor almost 5 years
  
  after reading the answers, I would say '<' is never slower than '<='
- Vinícius over 4 years
  
  @JonathonReinhart, there has never been a book. That is a lie I've told when I was younger
- Mikhail T. over 3 years
  
  All of the answers seem to be about C, whereas the question is tagged C++ -- where there is simply no telling, what < and <= might mean for any given a...
- Peter Cordes over 2 years
  
  @Joshua: 8086 has jg and jge (and the equivalent unsigned branch conditions). So no, it's not faster on x86 and never has been, if we're talking about normal compare-and-branch. Assembly - JG/JNLE/JL/JNGE after CMP - those instructions have all existed since 8086. I'd suggest removing your earlier comment that has incorrect information, or clarify what special case you meant. In asm, < allows smaller immediate constants that can save a byte, so GCC transforms for you.
- vpalmu over 2 years
  
  @PeterCordes: For some reason the compiler I learned on would always generate jl +2 jmp ... instead of jge when compiling for x86, but had not problem generating jg ... directly. As you have pointed out, this is pretty dumb and I'm reasonably certain modern compilers don't do this. But you put up with the compilers you had.
- Peter Cordes over 2 years
  
  @Joshua: Ok, so it was true for some specific dumb C(?) compiler for 8086. That's believable, but it's important to say that, not that it was a fact of the ISA.
Michael Petrotta over 11 years

Note that this is specific to x86.
Adrian Cornish over 11 years

Indeed - I should have said that - but any compiler could be smart enough to generate this code
Joseph White over 11 years

I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)
Adrian Cornish over 11 years

@Lipis Sorry - I do not understand your comment - could you clarify - I showed the asm generated from both if statements
Joseph White over 11 years

@AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)
Adrian Cornish over 11 years

Ah I get it - sorry I missed the different value in the OP's original question - my point was that the compiler edited the value in the generated ASM.
Qsario over 11 years

@AdrianCornish Your two statements aren't the same two statements as in the question. One of his has 900, not 901.
Adrian Cornish over 11 years

@Qsario quite right - I missed that - the point still stands though that the compiler is editing the values
Boann over 11 years

What about if (a <= INT_MAX)?
Qsario over 11 years

You're correct, but it would be good to edit in the other original statement as well for completeness :)
Joseph White over 11 years

@AdrianCornish Yes you're totally right.. and we are on the same page :) I edited it.. hope you don't mind..
Qsario over 11 years

@Boann That might get reduced to if (true) and eliminated completely.
Adrian Cornish over 11 years

@Qsario I think that muddies the point because in that case both asm statements become cmpl $900, -4(%rbp) so it is a little harder to see the difference. Since I am showing the asm from my code and not the OP's it is not wrong - but highlights the error in the book
Vinícius over 11 years

please consider the following: typedef int a, typedef int b, a c = 1; b d = 2; if( c < d ) & if( c <= d ) as c and d are different types
Vinícius over 11 years

I wanted to see the ASM generated code for it. To be honest there are a lot more of examples I would like to see ASM generated code, specially about chars
Adrian Cornish over 11 years

@ViniyoShouta Try it for yourself - g++ --save-temps myfile.cc will give you the a .s file so you can read the asm for yourself :-)
Adrian Cornish over 11 years

@Lipis A fair edit but I am glad you changed it back as I think it highlights the difference better. I get the OCD - whats why we are programmers :-)
Seng Cheong over 11 years

No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.
Adrian Cornish over 11 years

@JonathonReinhart Totally agree - but the OP's question was with constants. But I see the asm generated is the same - except that the LHS is moved to a register cmpl -4(%rbp), %eax
Seng Cheong over 11 years

@Dyppl actually jg and jnle are the same instruction, 7F :-)
Seng Cheong over 11 years

@AdrianCornish you're not showing the whole picture. That's just the compare, which sets the flags, which is always the same. You still will have a differnt Jcc instruction depending on the conditional. See my example.
Martin York over 11 years

IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.
maksimov over 11 years

@JonathonReinhart are you sure your example is not the other way around? I.e. isn't < compiled to jg and <= to jge?
Ben over 11 years

@maksimov it is probably correct, the asm code for (a < b) ... says: jump if a >= b which is equivalent to do something if a < b.
jcolebrand over 11 years

What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?
greyfade over 11 years

Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.
Ethan Reesor over 11 years

It should be noted that in the computation loop of a process/function/program, an additional instruction could make a difference. More relevant than the speed is the fact, as @greyfade mentioned, that most modern CISC processors have jump/branch instructions that check both the carry and the zero flags, thus still only using a single instruction.
Seng Cheong over 11 years

Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.
JustinDanielson over 11 years

He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1
Jon Hanna over 11 years

Even if it is true for a given architecture. What are the odds that none of the compiler writers ever noticed, and added an optimisation to replace the slower with the faster?
Admin over 11 years

This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.
Lucas over 11 years

This is also the case on the 6502 and 65816 processor family, which extends to the Motorola 68HC11/12, too.
Seng Cheong over 11 years

@JustinDanielson agreed. Not to mention ugly, confusing, etc.
Adrian Cornish over 11 years

@JonathonReinhart Good point. Edited to include the jump statements.
Lucas over 11 years

@JonHanna: This is the optimized version. For a loop, the branch-if-equal instruction is only encountered on the last iteration of the loop, so its impact is amortized down to some fraction of a cycle. Inverting the test would require placing an additional instruction into the inner loop which would impact every loop iteration. Also, it may not be possible to reverse the order of comparison, because these were typically accumulator architectures, and spilling the accumulator to memory would have been vastly more expensive than just adding the extra conditional branch instruction.
Telgin over 11 years

Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.
jfs over 11 years

Lucas: @Jon might mean A < (B + 1) optimization if B is a constant.
Dietrich Epp over 11 years

@JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.
Elazar Leibovich over 11 years

Not to mention that the optimizer can modify the code if indeed one option is faster than the other.
Seng Cheong over 11 years

@DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.
Dietrich Epp over 11 years

@JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.
Admin over 11 years

I like this answer because it remind me of the fun I had with a 6502 and how much I missed having to think about the flags once I moved to C. It also demonstrates that the question is deeper and more interesting than most people have given it credit.
Martin York over 11 years

@lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.
Gunther Piez over 11 years

Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)
hippietrail over 11 years

I'm pretty sure some of these architectures are still in embedded use so even if they were born in the '80s they didn't necessarily die there.
Tyler Crompton over 11 years

I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.
jontejj almost 11 years

just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.
Seng Cheong almost 11 years

@jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.
jontejj almost 11 years

Yeah, saw that now. I still think your first sentence leads someone to draw that conclusion for the wrong reasons. "You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions" this is actually not the main point that you should be making yet it's the first one you make. One would have to read your edited part furthest down to understand why. Otherwise your answer is top-notch!
jontejj almost 11 years

"If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)" I think this should be your main point.
Seng Cheong almost 11 years

@jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.
Seng Cheong almost 11 years

@hirschhornsalz You are absolutely correct. I'm not convinced that there is any architecture and scenario where this double test would be required.
Lucas almost 11 years

@JonathonReinhart you're basically right. Even in the 80's a peephole optimizer would invert the comparison or reorder the if/else code branches to eliminate the additional test. But a naive compiler or inexperienced assembly language programmer might still produce such output.
autistic almost 11 years

+1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...
Konrad Borowski over 10 years

@Lucas: Actually, that's not true for 6502 (and 65816). 6502 has two branch comparison instructions of interest in this case: BCC and BCS. BCC works like >=, and BCS works like <. For example, LDA $01 : CMP $02 : BCS label implements <. If you need <=, you can simply swap arguments - LDA $02 : CMP $01 : BCC label
Lucas over 10 years

@GlitchMr: Yes, the 6502 has a negated form of the carry flag test, but the point I was trying to make is the need for two separate instruction to test the carry flag (BCC/BCS) and the zero flag (BEQ/BNE) since the 6502 has no instruction for testing multiple values of the P register simultaneously. Having the BCC/BCS pair just makes it trivial to invert the comparison without having to change the value in the accumulator.
Brett Hale over 9 years

I'd just add that cmp sets the FLAGS register "in the same manner as the sub instruction". In fact, "The comparison is performed by subtracting the second operand from the first operand" - so carry / borrow propagation is involved. i.e., it is not a simple bit-wise operation in terms of hardware 'parallelism'.
Seng Cheong over 9 years

@Brett indeed, but the Jcc instruction tests bits which are already set. Your points are valid, but I don't see how it really applies to the question at hand.
Abhishek Singh about 9 years

Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?
Pavel Gatnar about 9 years

@AbhishekSingh NOT is just made by other instruction (je vs. jne)
alexia almost 9 years

"This holds true for x87 floating point as well" Is that some new architecture I've never heard of? ;)
Peter Cordes almost 9 years

@JonathonReinhart: In x86, some instructions set some flags, but leave others unchanged (e.g. inc/dec). Current out-of-order-execution CPUs rename flag bits separately, so inc doesn't have an input dependency on the previous value of the flags. A jcc that depends on multiple flags set by more than one instruction requires an extra uop to merge the flags (or in earlier Intel designs, causes a partial-flags stall.) So every jcc is the same internally, but their different dependencies can be an issue. Things used to be worse before flag-renaming improved.
Peter Cordes almost 9 years

@JonathonReinhart: also, see agner.org/optimize for more detailed info than you get from Intel's own manuals.
BeeOnRope almost 8 years

There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.
BeeOnRope almost 8 years

Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..
David Schwartz almost 8 years

@BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.
glglgl almost 8 years

@BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.
BeeOnRope almost 8 years

Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.
BeeOnRope almost 8 years

I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.
Peter Cordes over 7 years

Forgot to mention this last time, but not every JCC is the same. Some can macro-fuse with an immediately-preceding CMP or TEST instruction on Core2 and Nehalem. (And on Intel Sandybridge-family, with many different ALU instructions.) AMD CPUs that can macro-fuse at all (Bulldozer-family) can do it for any JCC, even the weird ones like JP that Intel never macro-fuses with anything.
Seng Cheong about 7 years

@PeterCordes Since I wrote this answer, I've taken a graduate level computer architecture class and gained a lot more understanding of the intricacies of pipelining and register renaming, etc. I'm still quite convinced that my answer (essentially just "No.") is correct, but I'm not quite sure what to add to my answer to make it correct from the standpoint of a modern out-of-order superscalar CPU. Perhaps the simple answer is "regardless of the underlying machinery, the hardware is capable of looking at multiple condition flags simultaneously". Any thoughts?
Peter Cordes about 7 years

Yeah, it's not the actual testing of multiple bits in EFLAGS that's ever the problem on x86. It's partial-flags renaming, since not all instructions write every flag, but CPUs try to avoid false dependencies by renaming different parts of EFLAGS separately. (This isn't an issue for < vs. <=). On Intel pre-Haswell, reading a flag that was left unmodified by the previous flag-writing instruction is slow. (Much slower on pre-Sandybridge, like you can see in this question: stackoverflow.com/questions/32084204/…)
Peter Cordes about 7 years

Anyway, my comments were only trying to correct the over-generalization that all JCCs are equal. They aren't, because some can macro-fuse and some can't, even when used after an instruction like CMP that writes all flags (avoiding any partial-flag renaming stalls or slowdowns).
supercat over 6 years

@hirschhornsalz: Inverting the operands is a standard technique on the 8080, but various factors may make it better to evaluate a particular operand first. For example, given static unsigned char x;, the expression x < 20 could be evaluated as ld a,(x) / cmp 20 / jnc nope but reversing the operands of x > 20 would require something like ld a,20 / ld hl,x / cmp (hl) / jnc nope. Better would be to keep the order but substitute x <= 21: ld a,(x) / cmp 21 / jc nope.
Peter Cordes over 5 years

Like supercat said, smart compilers can and do compile C++ comparisons into efficient asm using various tricks. If either operand is a compile-time constant, it can make asm that checks x < 21 instead of x <= 20. Or on x86, maybe compilers choose to make constants smaller magnitude, so they'll fit in a signed 8-bit immediate instead of a 32-bit immediate. e.g. x <= 127 instead of x < 128. But if both are runtime variables, for( ... ; i < size ;) is guaranteed not to be an infinite loop, but i <= size might be (for unsigned)! This can defeat optimizations.
Peter Cordes over 5 years

BTW, gcc reduces the magnitude of immediates when it can, because for example on x86 an immediate from -128 .. 127 only needs 1 byte instead of 4. (There's no harm in just always applying the transformation for compile-time constants, except maybe on ARM where having all the set bits closer together is more likely to make it encodeable as an immediate... Would be interesting to try there with x < 0x00f000 and see if it turned into x <= 0x00efff)
rustyx over 5 years

Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?
Peter Cordes over 5 years

@rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt
Sam almost 4 years

I came here looking for a simple answer and found something that changed my life haha.
Don Hatch over 3 years

In IEEE floating point, a<=b does not mean the same thing as !(a>b).
Nate Eldredge over 3 years

As I understand it, the available immediates for cmp on AArch64 are simpler than your answer makes it sound: it takes a 12-bit immediate optionally shifted by 12 bits, so you can have 0xYYY or 0xYYY000, and you can also effectively negate the immediate by using cmn instead. This still supports your point, as cmp w0, #0xf000 is encodeable and cmp w0, #0xefff is not. But the "rotate into any position" phrasing sounds more like a description of the "bitmask" immediates, which AFAIK are only available for bitwise logical instructions: and, or, eor, etc.
Peter Cordes over 3 years

@NateEldredge: I think my description fits perfectly for ARM mode, where it's an 8-bit field rotated by a multiple of 2. (so 0x1fe isn't encodeable but 0xff0 is.) When I wrote this I hadn't understood the differences between AArch64 and ARM immediates, or that only bitwise boolean insns could use the bit-range / repeated bit-pattern encoding. (And mov; or with the zero reg is one way to take advantage of those encodings.)
Nate Eldredge over 3 years

It's not that simple, though. For instance, if a is in a register and b is a compile-time constant, then x86 can compute a-b in one instruction (sub rax, 12345 or cmp) but not b-a. There is an instruction for reg - imm but not the other way around. Many other machines have a similar situation.
Peter Cordes over 2 years

Further detail on partial FLAGS stuff, and the mechanism on Skylake involving never merging them, just having the two parts (CF and the SPAZO group) as separate inputs to instructions that need both. BeeOnRope explains in more detail in What is a Partial Flag Stall? This is fine for JCC, but a problem for cmovbe / cmova which are 2 uops, unlike other cmov instructions that are single uop.
benrg about 2 years

This whole answer is wrong. For example, INT_MAX > INT_MIN, but INT_MIN - INT_MAX is 1 (at the machine-code level, not in C), whose msb is 0. Doing these comparisons correctly is harder than you think.
Puddle about 2 years

@benrg Obviously it wouldn't be as simple to implement in code as the actual circuit with access to the carry-out/overflow. This answer was to explain the logic of the different comparisons. (it's all about the difference. it's sign) It'd obviously be handled in the ALU. (for different types like floats also) See this. So the answer isn't wrong. What you're bringing up is the limitations of integers. If we put INT_MIN in a long it'd store ffffffff80000000. And doing the subtraction would give ffffffff00000001. (a negative)