Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

c++ performance assembly optimization x86

164,567

Solution 1

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.

See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).

Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.

As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)

GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).

gcc -Os (optimize for size) does use IDIV for non-power-of-2 division, unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.

Helping the compiler

(summary for this case: use uint64_t n)

First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)

In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)

You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).

BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:

 # from gcc5.4 -O3  plus my comments

 # edx= count=1
 # rax= uint64_t n

.L9:                   # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi         # rdi = n>>1;
    test   al, 1       # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi    # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1      # ++count;
    cmp    rax, 1
    jne   .L9          #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n

The inner loop is branchless, and the critical path of the loop-carried dependency chain is:

3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).

Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).

The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.

Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.

The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.

Beating the compiler

GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.

It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.

 ### Hand-optimized version of what gcc does
.L9:                       #do{
    lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
    shr     rax, 1         # n>>=1;    CF = n&1 = n%2
    cmovc   rax, rcx       # n= (n&1) ? 3*n+1 : n/2;
    inc     edx            # ++count;
    cmp     rax, 1
    jne     .L9            #}while(n!=1)

See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)

Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.

Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).

LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).

So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)

Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.

Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.

The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.

You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).

I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.

Untested idea for manual vectorization

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vpsllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.

You can and should implement this with intrinsics instead of hand-written asm.

Algorithmic / implementation improvement:

Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)

@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.

So the loop might look like this:

goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).

It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).

We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).

TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.

They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)

Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.

Improvements to ideas / code from other answers:

@hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.

I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)

Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.

@Veedrac's code

I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.

code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.

See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Solution 2

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.

The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.

(The below code is 32-bit, but can be easily converted to 64-bit)

For example, the sequence function can be optimized to only 5 instructions:

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

The whole code looks like:

include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"

start:
        InitializeAll
        mov ecx, 999999
        xor edi, edi        ; max
        xor ebx, ebx        ; max i

    .main_loop:

        xor     esi, esi
        mov     eax, ecx

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

        cmp     edi, esi
        cmovb   edi, esi
        cmovb   ebx, ecx

        dec     ecx
        jnz     .main_loop

        OutputValue "Max sequence: ", edi, 10, -1
        OutputValue "Max index: ", ebx, 10, -1

        FinalizeAll
        stdcall TerminateAll, 0

In order to compile this code, FreshLib is needed.

In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.

The output of both programs is the same: max sequence = 525 on i = 837799.

Solution 3

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:

while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
    n = (3*n + 1) / 2;
    if (n % 2 == 0) {
        do n /= 2; while (n % 2 == 0);
        if (n == 1) break;
    }
}

Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:

3n+1 -> ???? 0000 0100
/ 2  -> ???? ?000 0010
/ 2  -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2  -> ???? ???0 0010
/ 2  -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2  -> ???? ???? ?010
/ 2  -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2  -> ???? ???? ???0
/ 2  -> ???? ???? ????

So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:

k = n / 256;
m = n % 256;

switch (m) {
    case 0: n = 1 * k + 0; break;
    case 1: n = 81 * k + 1; break; 
    case 2: n = 81 * k + 1; break; 
    ...
    case 155: n = 729 * k + 425; break;
    ...
}

Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.

PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like

static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }

while (n > 128) {
    size_t lastBits = n % 256;
    n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}

In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.

PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).

We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).

If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

Solution 4

On a rather unrelated note: more performance hacks!

[the first «conjecture» has been finally debunked by @ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
1. [even] [odd]
2. [odd] [even]
3. [even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.

Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.

Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
```
(N << 1) + N + 1:     (N >> 1) + N + 1:

        b10                    b1
         b1                     b
       +  1                   + 1
       ----                   ---
       bBb0                   bBb
```
where B = !b. Right-shifting the first result gives us exactly what we want.

Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.

As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.

The resulting algorithm looks like this:

uint64_t sequence(uint64_t size, uint64_t *path) {
    uint64_t n, i, c, maxi = 0, maxc = 0;

    for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
        c = 2;
        while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
            c += 2;
        if (n == 2)
            c++;
        if (c > maxc) {
            maxi = i;
            maxc = c;
        }
    }
    *path = maxc;
    return maxi;
}

int main() {
    uint64_t maxi, maxc;

    maxi = sequence(1000000, &maxc);
    printf("%llu, %llu\n", maxi, maxc);
    return 0;
}

Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.

[EDIT:]

Let`s translate this into assembly!

MOV RCX, 1000000;



DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;

@main:
  XOR RSI, RSI;
  LEA RDI, [RCX + 1];

  @loop:
    ADD RSI, 2;
    LEA RDX, [RDI + RDI*2 + 2];
    SHR RDX, 1;
    SHRD RDI, RDI, 2;    ror rdi,2   would do the same thing
    CMOVL RDI, RDX;      Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
    CMOVS RDI, RDX;
    CMP RDI, 2;
  JA @loop;

  LEA RDX, [RSI + 1];
  CMOVE RSI, RDX;

  CMP RAX, RSI;
  CMOVB RAX, RSI;
  CMOVB RBX, RCX;

  SUB RCX, 2;
JA @main;



MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;

@itoa:
  XOR RDX, RDX;
  DIV RCX;
  ADD RDX, '0';
  PUSH RDX;
  TEST RAX, RAX;
JNE @itoa;

  PUSH RCX;
  LEA RAX, [RBX + 1];
  TEST RBX, RBX;
  MOV RBX, RDI;
JNE @itoa;

POP RCX;
INC RDI;
MOV RDX, RDI;

@outp:
  MOV RSI, RSP;
  MOV RAX, RDI;
  SYSCALL;
  POP RAX;
  TEST RAX, RAX;
JNE @outp;

LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;

Use these commands to compile:

nasm -f elf64 file.asm
ld -o file file.o

See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

Solution 5

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.

However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:

Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

View more solutions

164,567

Author by

rosghub

Updated on August 01, 2022

Comments

rosghub over 1 year
I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
```
nasm -felf64 p14.asm && gcc p14.o -o p14
```
The C++ was compiled with:
```
g++ p14.cpp -o p14
```
Assembly, p14.asm:
```
section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret
```
C++, p14.cpp:
```
#include <iostream>

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = 3*n + 1;
        ++count;
    }
    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }
    std::cout << maxi << std::endl;
}
```
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).

The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.

But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.

Execution times

My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
- g++ (unoptimized): avg 1272 ms.
- g++ -O3: avg 578 ms.
- asm (div) (original): avg 2650 ms.
- asm (shr): avg 679 ms.
- @johnfound asm (assembled with NASM): avg 501 ms.
- @hidefromkgb asm: avg 200 ms.
- @hidefromkgb asm, optimized by @Peter Cordes: avg 145 ms.
- @Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.
rosghub over 7 years

Yes I haven't done any formal profiling but I have ran them both a few times and am capable of telling 2 seconds from 3 seconds. Anyway thanks for answering. I already picked up a good deal of info here
Peter Cordes over 7 years

It's probably not just a measurement error, the hand-written asm code is using a 64-bit DIV instruction instead of a right-shift. See my answer. But yes, measuring correctly is important, too.
Peter Cordes over 7 years

@jefferson: just noticed that gcc's -O3 loop wasn't optimal after all. I probably didn't save any latency except through avoiding occasional resource conflicts, but saved several instructions.
Peter Cordes over 7 years

Huh, that's clever. SHR sets ZF only if EAX was 1 (or 0). I missed that when optimizing gcc's -O3 output, but I did spot all other optimizations you made to the inner loop. (But why do you use LEA for the counter increment instead of INC? It's ok to clobber flags at that point, and lead to a slowdown on anything except maybe P4 (false dependency on old flags for both INC and SHR). LEA can't run on as many ports, and could lead to resource conflicts delaying the critical path more often.)
johnfound over 7 years

Well, not particular reason for lea in the first instruction. You are right that inc will be at least more readable. No difference in speed though.
Peter Cordes over 7 years

Unfortunately that optimization doesn't really help with the latency of the loop-carried dependency chain. The only way to get big speedups over the compiler's -O3 output will be to interleave the calculations for multiple n values in parallel (see my answer). But this is a great first step showing how to optimize the properly. Are your timings comparing against gcc -O0 output? Or MSVC debug build, or something?
johnfound over 7 years

Well, parallel computations, of course will make the things much faster, but we are talking about compiler vs hand assembly here. That is why I implemented the same non-parallel algorithm.
Peter Cordes over 7 years

I didn't mean multi-threaded parallel, I meant doing two iterations of the outer loop at once, in a separate set of registers. A compiler would be allowed to do that after inlining the function into main (although probably no current compilers will do that for you).
johnfound over 7 years

@PeterCordes I don't know C/C++ so, compiling with the first command found in google: g++ -Wall q14.c -o q14 :) Including "-O3" makes the difference smaller, but still more than twice slower: 830ms vs 430ms
Peter Cordes over 7 years

You wouldn't see a speed diff for INC vs. LEA, since your AMD CPU is a Bulldozer-family uarch, right? They run LEA on either of the EX01 pipes, same as INC. On Intel SnB-family uarches, simple LEA runs on only two ports of the 3 or 4 ALU ports, and one of those ports is the one needed by complex-LEA (the [3*eax + 1]), according to Agner Fog's testing.
Peter Cordes over 7 years

Interesting, thanks for timing with -O3. Yes, that's the right command.
EOF over 7 years

I've tried out the vectorized approach a while ago, it didn't help (because you can do much better in scalar code with tzcnt and you're locked to the longest-running sequence among your vector-elements in the vectorized case).
Peter Cordes over 7 years

The compiler output from the OP's source code still sucks because the OP used a signed type (and the C semantics for signed division by 2 can't be implemented with just an SHR). That's probably the major reason why compiler output is still losing. Try changing long to unsigned long, like I did in the godbolt link in my answer. That gets a pretty reasonable loop that should have about the same critical path length as our hand-optimized asm loops.
johnfound over 7 years

@PeterCordes - I am not very good in speed optimizations and rarely optimize my code to the last extend for speed. So, you are probably right, so I replaced the LEA with INC.
Peter Cordes over 7 years

Oh, actually Bulldozer might bottleneck on throughput with the compiler output. It has lower latency CMOV and 3-component LEA than Haswell (which I was considering), so the loop-carried dep chain is only 3 cycles in your code. It also doesn't have zero-latency MOV instructions for integer registers, so g++'s wasted MOV instructions actually increase the latency of the critical path, and are a big deal for Bulldozer. So yeah, hand-optimization really does beat the compiler in a significant way for CPUs that aren't ultra-modern enough to chew through the useless instructions.
Peter Cordes over 7 years

@EOF: tzcnt? Oh, right, to do all the n/2 iterations in one step! I wasn't even looking for algorithmic shortcuts; great point. Yeah, the separate-termination problem for stuff like this or Mandelbrot is a problem. You could detect when a single element is done, and replace it, but that's a lot of extra logic, and extra breaking out of the loop. Still, it might be worth it when the average sequence length is high enough. (Hmm, even AVX512CD only has a vectorize lzcnt, not tzcnt, though, right? So prob. not useful compared to a scalar tzcnt version.)
EOF over 7 years

@PeterCordes: I don't think masking out parts of the vector will help, since the longest element still needs to complete (unless you presciently knew which it was and used the faster scalar method on it). The point here is that vectorizing doesn't give you a speedup of your vector-width, but only a fraction of it, and the algorithm is worse, particularly once you use memoization and Haswell's scatter/gather sucks.
Peter Cordes over 7 years

@EOF: no, I meant breaking out of the inner loop when any one of the vector elements hits 1, instead of when they all have (easily detectable with PCMPEQ/PMOVMSK). Then you use PINSRQ and stuff to fiddle with the one element that terminated (and its counters), and jump back into the loop. That can easily turn into a loss, when you're breaking out of the inner loop too often, but it does mean you're always getting 2 or 4 elements of useful work done every iteration of the inner loop. Good point about memoization, though.
EOF over 7 years

@PeterCordes Ah, I hadn't considered that. It sounds pretty nasty to implement though. Also, for testing whether all/any of the elements are done, I'd try ptest before pmovmsk, it does both all and any in one instruction if you're careful. (it's also one clock less latency on Haswell).
Peter Cordes over 7 years

Bullet points are more appropriate formatting than a code block. Please stop putting your text into a code block, because it's not code and doesn't benefit from a monospaced font.
Peter Cordes over 7 years

@EOF: Hmm, even using @johnfound's idea of testing for n==1 by looking for n>>1 == 0, I don't think you can avoid doing a PCMPEQQ. Besides, the scalar code needs the mask to know which element hit the termination condition. Rule of thumb: PTEST is not the best way to branch on a PCMP mask. It's 2 + 1 uops including the jcc. (2+1c latency, too). MOVMSKPD + fused TEST+JCC is 1+1 uop, and 2+1c latency. (MOVMSKPD is worth possible bypass-delay from an integer compare, otherwise we need PEXT or something since we want 64-bit elements not PMOVMSKB 8-bit elements).
Cody Gray over 7 years

I don't really see how this answers the question. This isn't a vague question about whether assembly code or C++ code might be faster---it is a very specific question about actual code, which he's helpfully provided in the question itself. Your answer doesn't even mention any of that code, or do any type of comparison. Sure, your tips on how to benchmark are basically correct, but not enough to make an actual answer.
Cody Gray over 7 years

Great answer, as usual. I saw the title a few hours ago in the "hot questions", and came to write a canonical answer. Your already having done so saved me a lot of time. :-) I'm a bit puzzled though by your statement that you could shorten the dependency chain by splitting up the LEA into LEA+INC. Given my understanding of dependency chains, the fact that the destination of LEA is the source of INC would mean that this would have no effect on its length. Do I misunderstand something here, or did you mean that this would shorten the dependency chain as long as these insns weren't back to back?
Cody Gray over 7 years

Also, I think this is a rather interesting snippet of code, as far as analyzing compiler output goes. It is one of the rare cases where GCC seems to produce far and away the best code, compared to Clang, which has recently (at least from what I've seen) been beating the pants off of GCC in terms of optimizations. (Its code is often more optimal, and virtually never inferior.) Obviously ICC emits quality code, but that is expected. MSVC's output is also interesting, mainly because it is terrible. The inner loop is full of jumps. I didn't benchmark it, but my nose tells me not to expect much.
luk32 over 7 years

"Claiming the C++ compiler better is very bad mistake. And especially in this case. The human always can make the code better that the and this particular problem is good illustration of this claim." You can reverse it and it would be just as valid. "Claiming a human is better is very bad mistake. And especially in this case. The human always can make the code worse that the and this particular question is good illustration of this claim." So I don't think you have a point here, such generalizations are wrong.
Peter Cordes over 7 years

@CodyGray: With unsigned long n, both gcc6.2 and clang3.9 make pretty similar asm with -O3. clang uses one fewer MOV. Are you talking about the OP's original code, where clang -O3 uses a conditional branch to skip the signed /= 2 which it does with shr/add/sar?
Peter Cordes over 7 years

@CodyGray: Different LEA instructions have different latency. LEA with 2 components is 1c latency. LEA with 3 components, requiring two adds ( [base + idx*2 + disp]) is 3c latency (not just 2c, because of SnB-family's standardization of latency). It also only runs on p1.
johnfound over 7 years

@luk32 - But the author of the question can not be any argument at all, because his knowledge of assembly language is close to zero. Every arguments about human vs compiler, implicitly assume human with at least some middle level of asm knowledge. More: The theorem "The human written code will always be better or the same as the compiler generated code" is very easy to be formally proved.
Peter Cordes over 7 years

@luk32: A skilled human can (and usually should) start with compiler output. So as long as you benchmark your attempts to make sure they're actually faster (on the target hardware you're tuning for), you can't do worse than the compiler. But yeah, I have to agree it's a bit of a strong statement. Compilers usually do much better than novice asm coders. But it's usually possible to save an instruction or two compared to what compilers come up with. (Not always on the critical path, though, depending on uarch) . They're highly useful pieces of complex machinery, but they're not "smart".
KevinZ over 7 years

Your optimized solution doesn't work for even numbers in addition to 1. You should add a tzcnt shift before the loop and change the loop from a do while to a while.
Cody Gray over 7 years

Oh, silly me. I just used the Godbolt link from the top of your answer, without bothering to check whether the unsigned fix had been made. Sure enough, once you do that, GCC and Clang indeed emit very similar code. I was looking at the original code, expecting it to be unsigned, and wondering what all that extra business was that Clang was doing! Unfortunately, it doesn't fix MSVC's problem. Although it basically uses the same series of instructions, half of them are followed by a branch. No idea why. It seems to be failing to reorder the instructions to avoid having to skip over them.
Peter Cordes over 7 years

@KevinZ: good point, updated the C to fix that while still matching the layout of the asm. (That's why I used a do{}while())
Peter Cordes over 7 years

@CodyGray: I thought about adding another whole godbolt link, but decided to just add a comment in the -O0 one. And put the unsigned long n inside the bolded part before that code block. I had already noticed that I wasn't communicating that unsigned long n was important, but hadn't gotten around to fixing it.
KevinZ over 7 years

That fixes it too, but now you have also handled the n==1 case for free, so you can lose the comment about that caveat. Sorry to nitpick so much.
Reinstate Monica over 7 years

@jefferson While you can't give gold, you can give a bounty.
Yves Daoust over 7 years

"the overflown number will very likely converge toward 1 without another overflow": the code never stops. (That's a conjecture as I cannot wait until the end of times to be sure...)
Ped7g over 7 years

@YvesDaoust oh, but it does?... for example the 27 series with 8b truncation looks like this: 82 41 124 62 31 94 47 142 71 214 107 66 (truncated) 33 100 50 25 76 38 19 58 29 88 44 22 11 34 17 52‌ 26 13 40 20 10 5 16‌ 8 4 2 1 (rest of it works without truncation). I don't get you, sorry. It would never stop if the truncated value would be equal to some of previously reached in currently ongoing series, and I can't find any such value vs k-bit truncation (but I either can't figure out the Math theory behind, why this holds up for 8/16/32/64 bits truncation, just intuitively I think it works).
Ped7g over 7 years

I should have checked the original problem description sooner: "Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1." ... ok, no wonder I can't get grasp of it with my limited hazy Math knowledge... :D And from my sheet experiments I can assure you it does converge for every 2-255 number, either without truncation (to 1), or with 8 bit truncation (to either expected 1 or to 0 for three numbers).
Yves Daoust over 7 years

Hem, when I say that it never stops, I mean... that it does not stop. The given code runs forever if you prefer.
rosghub over 7 years

@johnfound "knowledge of assembly language is close to zero" - can confirm. This question was never meant to start generalized debate over human vs compiler. Really, I was hoping to hear from an experience asm dev as to what the compiler is doing to improve brute force over my baseline asm code. Thank you and Peter Cordes for shedding some light on that
rosghub over 7 years

Just ran your code, yeah much faster than c solution now. By replacing the division with a shift in my code, it's already at least 3 times faster than c. Your use of the carry flag improves it even more. very clever.
johnfound over 7 years

@jefferson Please, write some details about your CPU model/speed and the execution times. It can be interesting to compare the performance on different hardware.
rosghub over 7 years

ok I'll update the OP with details. I'm running on an Intel Celeron 2955U 1.4 GHz chromebook fyi, so latency is very noticable
rosghub over 7 years

@johnfound added CPU info and execution times to OP. I'm getting avg 70-80 ms faster with your asm solution vs c++ -o3. Converted to nasm syntax though so possibly your assembler is better?
johnfound over 7 years

@jefferson There is no "better" assembler. What is good in assembly language is that what you write is what you get. All code good or bad is all because of you.
Peter Cordes over 7 years

Don't feel like you need to award a bounty or anything, though, @jefferson. Writing a clear question that was easy to answer and gets lots of people looking at my answer already helped me plenty. (The question even has some depth if we get into how you could actually optimize.) I already got a gold badge out of this :)
rosghub over 7 years

@Peter Cordes good you totally deserve it. I learned way more from reading your answer than I did in days of reading assembly tutorials. Also have read a bit of those PDFs on optimizing asm you linked. Thanks again for taking the time truly
Peter Cordes over 7 years

@jefferson: your Celeron 2955U is an Intel Haswell, so the perf analysis in my answer applies exactly to your CPU. John's asm should run the loop at ~5 cycles per iteration, same for my optimized asm, and the -O3 compiler output with unsigned long n. Interesting that the compiler output from your original source with signed long is only a bit slower; Haswell is damn impressive at chewing through gcc's extra insns.
johnfound over 7 years

@PeterCordes The strange here is that regardless of the faster CPU and the better architecture, my asm version still running slower than on AMD - 430ms vs 501ms;
Peter Cordes over 7 years

@johnfound: That's not strange, just interesting. I already pointed out that Haswell has higher latency CMOV and complex-LEA than Bulldozer, so yes, in this purely CPU-bound specific loop, Bulldozer runs it significantly faster than Haswell, enough to overcome a 1.4:1 clock-speed difference (modulo measurement error with power-saving). The Haswell is generally faster on average, but that doesn't mean it's faster at everything. (BTW, Broadwell and Skylake would run this about 20% faster per clock than Haswell, since they handle CMOV as 1uop with 1c latency.)
Peter Cordes over 7 years

AMD CPUs often do well on latency-bound integer workloads like this. They also do well on extended precision ADC / SBB loops, since (like CMOV), that's another 3-input instruction (2 integer regs + FLAGS) which Intel pre-Broadwell handled as two uops. Intel CPUs also have slower variable-count shift instructions (shl eax, cl: 3 uops, 2c latency vs. 1M-op 1c latency). However, gmplib.org/gmpbench.html shows that Intel CPUs do still have a sizeable lead in score per GHz with current GMP. Maybe they use SSE/AVX now, where AMD has worse latency?
Peter Cordes over 7 years

Upvoted for analysis of what happens on overflow. The CMP-based loop could use cmp rax,1 / jna (i.e. do{}while(n>1)) to also terminate on zero. I thought about making an instrumented version of the loop that records the max n seen, to give an idea of how close we get to overflow.
Peter Cordes over 7 years

Integer division is not actually faster than FP division on modern x86 CPUs. I think this is due to Intel/AMD spending more transistors on their FP dividers, because it's a more important operation. (Integer division by constants can be optimized to a multiply by a modular inverse). Check Agner Fog's insn tables, and compare DIVSD (double-precision float) with DIV r32 (32-bit unsigned integer) or DIV r64 (much slower 64-bit unsigned integer). Especially for throughput, FP division is much faster (single uop instead of micro-coded, and partially pipelined), but latency is better too.
Peter Cordes over 7 years

e.g. on the OP's Haswell CPU: DIVSD is 1 uop, 10-20 cycles latency, one per 8-14c throughput. div r64 is 36 uops, 32-96c latency, and one per 21-74c throughput. Skylake has even faster FP division throughput (pipelined at one per 4c with not much better latency), but not much faster integer div. Things are similar on AMD Bulldozer-family: DIVSD is 1M-op, 9-27c latency, one per 4.5-11c throughput. div r64 is 16M-ops, 16-75c latency, one per 16-75c throughput.
Peter Cordes over 7 years

I think your for (n = i = (size - 1) | 1; i > 2; n = i -= 2) { loop has enough going on that it would be more readable to just pull the n=i out onto the next line, instead of repeating that logic in the initializer and in the i-=2 part.
rosghub over 7 years

Impressive algorithm there. I'll benchmark it and add results to OP when I get a chance. I knew there was a better way to work out the sequence as I saw others reporting sub 200 ms runtimes over on the PE thread.
MSalters over 7 years

Isn't FP division basically the same as integer-subtract exponents, integer-divide mantissa's, detect denormals? And those 3 steps can be done in parallel.
Peter Cordes over 7 years

Or make data lookup tables for the multiply and add constants, instead of a switch. Indexing two 256-entry tables is faster than a jump table, and compilers probably aren't looking for that transformation.
Peter Cordes over 7 years

Hmm, I thought for a minute this observation might prove the Collatz conjecture, but no, of course not. For every possible trailing 8 bits, there's a finite number of steps until they're all gone. But some of those trailing 8-bit patterns will lengthen the rest of the bitstring by more than 8, so this can't rule out unbounded growth or a repeating cycle.
Veedrac over 7 years

It's not important, but FWIW it's pretty easy to move the _mm_tzcnt out of the loop by just iterating bit = n & -n; n = (n * 3 + bit) >> 1 (n & -n compiles to blsi). The rest can be fixed up by occasionally leaving the loop to shift values down.
Veedrac over 7 years

There is no integral Q such that 12 = 3Q + 1. Your first point is not right, methinks.
hidefromkgb over 7 years

@Veedrac: damn, you`re right. The proof of the first point is incorrect. However I still feel that the very point is true, but I don`t have enough time today to come up with something more suitable to prove it with.
Peter Cordes over 7 years

@Veedrac: Yeah that's probably good. tzcnt + shift on the critical path is probably more latency than it's worth, since long runs of zeros are rare. BMI2 makes variable-count shift only 1 cycle (instead of 3), with SHRX. TZCNT/shift does remove any need for a CMOV, which is nice. I've been playing around with hidefromkgb's idea, and Intel pre-Broadwell has 2c latency CMOV.
Peter Cordes over 7 years

@Veedrac: Been playing around with this: It can be implemented with better asm than the implementation in this answer, using ROR / TEST and only one CMOV. This asm code infinite-loops on my CPU, since it apparently relies on OF, which is undefined after SHRD or ROR with count > 1. It also goes to great lengths to try to avoid mov reg, imm32, apparently to save bytes, but then it uses the 64-bit version of register everywhere, even for xor rax, rax, so it has lots of unnecessary REX prefixes. We obviously only need REX on the regs holding n in the inner loop to avoid overflow.
Peter Cordes over 7 years

I put a version with de-obfuscated C on Godbolt, but its short-link generation is broken right now? I stuck a link to it at the end of my answer for now. It's a bit too messy to edit into someone else's answer ATM. Anyway, it actually compiles to better-looking asm than the original, but my hand-tuned asm based on this idea and hide's asm is even faster.
Peter Cordes over 7 years

@Veedrac and hidefrom: ok, updated my answer with working code based on this.
Veedrac over 7 years

@PeterCordes I'm not sure you fully grokked my suggestion; it's a lot faster than a CMOV. Try this: godbolt.org/g/8qggOm.
Peter Cordes over 7 years

Timing results (from a Core2Duo E6600: Merom 2.4GHz. Complex-LEA=1c latency, CMOV=2c). The best single-step asm inner-loop implementation (from Johnfound): 111ms per run of this @main loop. Compiler output from my de-obfuscated version of this C (with some tmp vars): clang3.8 -O3 -march=core2: 96ms. gcc5.2: 108ms. From my improved version of clang's asm inner loop: 92ms (should see a much bigger improvement on SnB-family, where complex LEA is 3c not 1c). From my improved + working version of this asm loop (using ROR+TEST, not SHRD): 87ms. Measured with 5 reps before printing
Peter Cordes over 7 years

@Veedrac: Yeah, I was still in the middle of thinking about the two-steps-at-once update I just posted. Thanks for the godbolt link, that makes it clear. I see you came up with a way to actually decide when it's safe to skip the right shifts. Yeah, that should help a ton. (There's no way to compute k from bit or something, is there? gcc's .L11: mov esi, 5; jmp .L2 blocks for every possible k it could break on seem like a waste, but I guess it's actually better than incrementing a counter in the fully-unrolled loop.) Cool way to take advantage of wide registers.
Peter Cordes over 7 years

@MSalters: yeah, that sounds right, but with a normalization step at the end ot shift bits between exponent and mantiss. double has a 53-bit mantissa, but it's still significantly slower than div r32 on Haswell. So it's definitely just a matter of how much hardware Intel/AMD throw at the problem, because they don't use the same transistors for both integer and fp dividers. The integer one is scalar (there's no integer-SIMD divide), and the vector one handles 128b vectors (not 256b like other vector ALUs). The big thing is that integer div is many uops, big impact on surrounding code.
Veedrac over 7 years

@PeterCordes The jumps do look wasteful but they don't seem to cost that much. You can just remove them to check how much they cost - you'll get the wrong result but from the right code path. I find the difference is a couple of percent; the branch predictor seems to be doing its job.
rosghub over 7 years

added @hidefromkgb and your optimized version times on my system to OP. runs in 145ms. I'm in total envy of your asm skills. I need to go back to CS1 and learn about bits again lol.
Veedrac over 7 years

@jefferson If you're adding timings to the question, why not add this version too?
rosghub over 7 years

@Veedrac sure; how did you compile this? gcc -std=c++11 p14.cpp -o p14 isn't resolving call _tzcnt_u64()
Veedrac over 7 years

@jefferson Add -march=native to use intrinsics, assuming your CPU supports tzcnt. (Remember to change limit back to its original value before timing!)
rosghub over 7 years

@Veedrac yeah I tried -march=native and its still not compiling. I guess this means my CPU doesn't support it?
Veedrac over 7 years

@jefferson Hang on, I'll make a version that doesn't need it. Should still be faster.
Veedrac over 7 years

@jefferson Best I managed is godbolt.org/g/1N70Ib. I was hoping I could do something smarter, but it seems not.
Peter Cordes over 7 years

To update count, you need a third array, right? adders[] doesn't tell you how many right-shifts were done.
Peter Cordes over 7 years

For larger tables, it would be worth using narrower types to increase cache density. On most architectures, a zero-extending load from a uint16_t is very cheap. On x86, it's just as cheap as zero-extending from 32-bit unsigned int to uint64_t. (MOVZX from memory on Intel CPUs only needs a load-port uop, but AMD CPUs do need the ALU as well.) Oh BTW, why are you using size_t for lastBits? It's a 32-bit type with -m32, and even -mx32 (long mode with 32-bit pointers). It's definitely the wrong type for n. Just use unsigned.
Peter Cordes over 7 years

@Veedrac: Celeron/Pentium (even Skylake Celeron) doesn't support any VEX-encoded instructions (no AVX, no BMI1/2). Since it doesn't support all of BMI1, they may have left out even the BMI1 instructions that don't require a VEX prefix (like TZCNT). Even if it does support TZCNT, -march=native definitely won't enable it because the CPU doesn't report BMI1 support. Your best alternative is BSF, which does have intrinsics. It's identical to TZCNT when the input is non-zero. (except it sets flags differently, and is slower on AMD.) There's __builtin_ctz, and an Intel intrinsic for BSF.
Peter Cordes over 7 years

@Jefferson: Compiling with -mbmi1 would tell the compiler it could use TZCNT. Even if your CPU doesn't support it, it would run as rep bsf which gives the same answer when the input is non-zero. However, that would also tell the compiler it could use BLSI, which your CPU doesn't support. Probably the easiest thing to test is actually LZCNT, since that gives different results from BSR even in the non-zero case. (But it still decodes as REP BSR on CPUs that don't recognize it). BSF and BSR have dest=undefined on input=0, with the actual behaviour being that dest=unmodified on Intel.
Peter Cordes over 7 years

@jefferson: Times with -O0 are not interesting or relevant to anything. Don't bother timing with -O0 (unless you're actually working on compiler internals or something).
Veedrac over 7 years

@jefferson PeterCordes makes a good point about BSF. Here's a version using __builtin_ctzll. I'd imagine your times are hurt most by the lack of a blsi instruction, though.
Peter Cordes over 7 years

@Veedrac: BLSI on Haswell is 1c latency, but the MOV+NEG+AND sequence is 2c latency (with zero-latency mov). The rest of the critical path in that loop is simple_LEA(1c) + ADD(1c), so yeah, it raises the critical-path latency from 3 to 4 cycles :( This dual-strategy approach makes it trickier to implement a version that interleaves calculations for two sequences in parallel, which would hide that latency. Due to the latency bottleneck, it should be pretty Hyperthreading-friendly if you parallelize the loop that calls sequence, and prob. get near-linear speedups from 2 threads on one core.
Peter Cordes over 7 years

@Veedrac: Further ideas: break out of the deferred-shift loop when carry is detected. (It's possible to write code in C that gets gcc to just check CF as set by the ADD instruction that computes n * 3 + bit. IIRC, by checking if the unsigned result is less than one of the inputs.) Then we just need to tzcnt & shift, and get back into the delayed-shift loop. That requires an unpredictable branch, though, instead of a fully-unrolled 16 iterations.
Peter Cordes over 7 years

Another way to stay in the delayed-shift loop longer, or get back in sooner: do the right-shifts that are always needed, like right shift by 4 every 4 iterations. Tiny impact on critical path, and no variable branching (except for bit == n exit condition, which BTW could be more efficient if it skipped the while (n >> maximum_bit_size) check and jumped to dedicated return-path code, for faster returns and because then the branch to decide whether to run the non-deferred loop could jump right back into the deferred loop instead of jumping to a n==1 check. Short sequences matter.
Peter Cordes over 7 years

@Veedrac: We can also apply a variant of hidefromkgb's 2-step approach for the non-deferred loop: count += 1 + (n&1); n = (n&1) ? (n>>1) + n + 1 : n>>1;. Since AND sets flags, gcc should sort that out into good code. count isn't on the critical latency path anyway. Also, gcc compiles while(n >> maximum_bit_size) into a shr rdi, 36 / jne, instead of a TEST/JNE with a mask in another register. Macro-fusion should reduce latency for detecting branch mispredictions. Also could maybe unroll a few iterations of that so the branch is taken more.
rosghub over 7 years

@Veedrac I ran your __builtin_ctzll version and got nearly identical times as the previous one. I managed to get the _tzcnt_ version compiled with gcc -mbmi -std=c++11 -O3 p14.cpp but it crashes and prints Illegal instruction. Gcc isn't recognizing -mbmi1 as an option
rosghub over 7 years

@Vreedrac actually just got it working with no optimizations (-O0). its running around 280-290ms
Peter Cordes over 7 years

@jefferson: Yeah, it's -mbmi, not bmi1, my mistake. That tells gcc it can use any BMI1 instruction, including BLSI, which your CPU doesn't support. (With -O0 it doesn't do that peephole optimization). The only question is whether it runs TZCNT and LZCNT as their own instructions, or whether it runs them as BSF and BSR (ignoring the REP prefix). They're part of BMI1, but they don't use the VEX encoding for their machine code. If you're curious, you can try a test by putting them in a .asm and single-stepping in a debugger. (unlike LZCNT, TZCNT vs. BSF is very subtle, see the docs.).
Veedrac over 7 years

@PeterCordes Breaking out on overflow adds a branch in the main loop, which is still the most expensive part. Shifting by 4 every now and again makes sense but it doesn't make the loop that much longer, so doesn't really pay for itself. The extra n >> maximum_bit_size check on return doesn't bother me much because it's easy to predict and it only hurts once per call anyway. I've not found optimizing the innards of the non-deferred loop further to be beneficial either, mostly because it's a largely cold branch.
Peter Cordes over 7 years

@Veedrac: the branch doesn't lengthen the critical path latency of the loop-carried dependency chain, though. I wouldn't be surprised if it doesn't help, but if it hurts, it should just be from mispredicts. Haswell can macro-fuse ADD/JC, (and make 2 fusions in one block) so the uop count doesn't even change. (If you manage to write the C++ in a way that the compiler likes! Not always a given). This should be running out of the uop cache 100%, but the extra machine-code size would reduce the uop-cache density, and maybe lead to lines/ways that have fewer uops, reducing frontend throughput.
Peter Cordes over 7 years

@Veed Current version of my tweaks to your code (godbolt.org/g/IJt9eE), with the #ifs enabled. clang actually makes some nasty code when fully-unrolling, so an if(k%4 == 3) n>>=4; actually improves it. gcc is ok either way. Assuming the loop-branch predicts the 4/1 taken/not-taken pattern perfectly, it should be near-zero overhead. I messed around with the n>>max_size loop, too, and got some interesting compiler behaviour. I had to hand-hold gcc and clang to the asm I wanted. Good point that it's cold, though. I hadn't checked whether it ever ran. Would be useful with big n.
T.E.D. over 7 years

@Jefferson - In a world where computers clobber chess grandmasters, 99.9% of developers have no business whatsoever writing their own assembly to try to "beat the compiler". If you didn't know enough about binary representation and CPU architecture to know that a shift would be way faster than a divide by 2, you are definitely in that 99.9%. Go do stuff the computer can't do for you (write a nice queue or something), and let the compiler do the stuff its better at.
Veedrac over 7 years

@PeterCordes Your tweaked version isn't correct; you need to account for the shifts down by increasing the counter. This can be done for free if the loop is unrolled by replacing count += k + shift with count += j + shift where j is also incremented in the branch; this gets pushed into the double jump and the compiled result is otherwise the same. // Clang's unrolled version is definitely worse than GCC's, but I still time it as 10% faster than the partially-unrolled variant. I don't know why. Any thoughts?
Peter Cordes over 7 years

Oh BTW, I miscalculated the latency with/without BLSI from BMI1: it's 2c per iteration with, 3c per iteration without, because BLSI and LEA can run in parallel, with an ADD at the end. Since this code bottlenecks on latency, BMI1 could make a 50% difference in run-time! Also, I found a version of the count+=one-or-two non-deferred loop that compiles to 4c latency for n's dep chain (3c on SKL). That's the same latency as for the best hand-tuned version that always does 2 at a time. @Veedrac: I'll look at 10% later. updated my answer with what I was working on. I noticed the same count bug.
Peter Cordes over 7 years

And now my answer is 30k chars, thanks to godbolt links (which are too long for goo.gl to shorten, so I can't even post them in comments :/). I think that means I'm done, unless I want to start writing more answers :P Anyway, I got a small speedup from doing a shift by the unroll count inside the loop (for Merom; I don't have a Haswell or SKL to test on. :(. To get gcc and clang to both not make a mess, I had to use a CPP macro to replicate the block :/
rosghub over 7 years

@T.E.D. I didn't write the assembly to try and beat the compiler, I wrote for the fun of it. And most of the PE's I solved in asm that didn't involve division were already faster the my C solutions; I asked about this one because of its simplistic nature and long (brute force) runtime
Peter Cordes over 7 years

@jefferson: Interesting to hear what your motivation was for writing this. Writing asm that actually computes something is usually more interesting than when the main problem is making the right system calls (like you see in a lot of SO questions from total beginners that get stuck trying to write their whole program in asm, including the reading input and printing results). However, for speed, were you comparing against un-optimized compiler output again? If so, you might find it interesting to see how much -O3 speeds up the C++ for different problems. It's not a constant factor...
rosghub over 7 years

yeah like I said I was never really concerned about speed, but it becomes interesting. Typically my asm times were somewhere between the -O0 and -O3. I usually take -O0 as a baseline and if my asm code is way slower then I start to wonder why. However with my novice asm knowledge I never expected to beat compiler -O3. I'm quite fascinated by those who can though. I'm curious do you work on embedded systems or another job related to coding in asm?
Veedrac over 7 years

@PeterCordes Nice work on the new code in your edit; that inner loop is pretty extreme. We're averaging better than 1 iteration per clock cycle on my laptop, which is pretty darn good :).
Peter Cordes over 7 years

@Veedrac: The unroll-by-16 is almost exactly what gcc does with your code. I only had to manually replicate the loop body (with a macro) to get more stuff to happen every 16th iteration without the compiler making a mess.
Peter Cordes over 7 years

@jefferson: No, I don't have a job doing this stuff. I just think it's cool. (I'd be interested in doing it professionally, though. I currently spend my time writing SO answers and working on open-source software). BTW, when I was learning asm, I mostly looked at compiler output to see how it did things, and using that as a starting point to see if I could see a different way to do it. That's definitely a way to learn good idioms (and occasionally some really neat transformations of the problem you're working on, like I described in comments on the single-iteration loop in my last update)
sidgeon smythe over 7 years

Not just the proof but your claim of "We may safely assume that the element we are looking for is odd" is incorrect: for example 18 and 54 are even numbers which produce longer chains than any smaller number. (Not counting 2 and 6 because for some reason you specified E ≥ 10.)
sidgeon smythe over 7 years

Here are the first 66 record-setters (A006877 on OEIS); I've marked the even ones in bold: 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, 10971, 13255, 17647, 23529, 26623, 34239, 35655, 52527, 77031, 106239, 142587, 156159, 216367, 230631, 410011, 511935, 626331, 837799, 1117065, 1501353, 1723519, 2298025, 3064033, 3542887, 3732423, 5649499, 6649279, 8400511, 11200681, 14934241, 15733191, 31466382, 36791535, 63728127, 127456254, 169941673, 226588897, 268549803, 537099606, 670617279, 1341234558
sidgeon smythe over 7 years

@hidefromkgb Great! And I appreciate your other point better too now: 4k+2 → 2k+1 → 6k+4 = (4k+2) + (2k+1) + 1, and 2k+1 → 6k+4 → 3k+2 = (2k+1) + (k) + 1. Nice observation!
Peter Cordes over 7 years

BTW, just posted a big update that tidied up the early sections, and added a section about guiding/hand-holding the compiler to the ASM output you want, instead of actually including hand-written asm in production code.
camden_kid over 7 years

The thing that amazes me about incredible answers such as this is the knowledge shown to such detail. I will never know a language or system to that level and I wouldn't know how. Well done sir.
Peter Cordes over 7 years

Totally agree with this. gcc -O3 made code that was within 20% of optimal on Haswell, for that exact algorithm. (Getting those speedups were the main focus of my answer only because that's what the question asked, and has an interesting answer, not because it's the right approach.) Much bigger speedups were obtained from transformations that the compiler would be extremely unlikely to look for, like deferring right shifts, or doing 2 steps at a time. Far bigger speedups than that can be had from memoization / lookup-tables. Still exhaustive testing, but not pure brute force.
Peter Cordes over 7 years

Still, having a simple implementation that's obviously correct is extremely useful for testing other implementations. What I'd do is probably just look at the asm output to see if gcc did it branchlessly like I expected (mostly out of curiousity), and then move on to algorithmic improvements.
Peter Cordes over 7 years

gnasher's answer shows that you can do much more than just cache the tails: high bits don't affect what happens next, and add / mul only propagate carry to the left, so high bits don't affect what happens to the low bits. i.e. you can use LUT lookups to go 8 (or any number) of bits at a time, with multiply and add constants to apply to the rest of the bits. memoizing the tails is certainly helpful in a lot of problems like this, and for this problem when you haven't thought of the better approach yet, or haven't proved it correct.
Emanuel Landeholm over 7 years

If I understand gnasher's idea above correctly, I think tail memoization is an orthogonal optimization. So you could conceivably do both. It would be interesting to investigate how much you could gain from adding memoization to gnasher's algorithm.
Peter Cordes over 7 years

Oh, right you could have another LUT for once you get below n=128. Good point, if a lot of sequences get down to small numbers quickly, the tails could end up taking a lot of time once you speed up the process of getting there by a huge factor.
Peter Cordes over 7 years

I doubt it's worth the CPU time to compute hash functions and manage a huge hash table to memoize the whole thing, since cache misses take hundreds of cycles. The recurrence with gnasher's method takes ~10 cycles on Haswell per iteration (L1 load-use latency(5) + multiply(3) + ADD + AND latency), doing 8 bits every time. I guess worth trying, though. IIRC, Skylake L1 load-use latency is 1c faster for integer loads.
Emanuel Landeholm over 7 years

You could tune memoization by only caching tails of length in a certain range. (I have working code that does just that). I wouldn't fret about hashing. Hashing is fast these days. Hashing is sufficiently close to O(1) that you needn't worry. You only need gnasher's code when there is a cache miss. The way I understand gnasher's code is that it is basically compressing a fixed number of iterations into one iteration, which means that collatz sequence generation is still O(log N) or whatever. O(1) beats O(log N) for N sufficiently large.
Peter Cordes over 7 years

The speed of a hash table on real hardware also worsens with size, as it grows larger than L1 cache, then larger than L2 cache, then L3, then TLB misses. (Or even larger than RAM, requiring a disk access). You have a good point, but the constant factors matter a lot. Gnasher's LUT is still actually O(seq_len): it does fixed numbers of iterations at once. The multiply and add can grow N more than knocking off the last 8 bits shrinks it. (Doing any better than O(seq_len) with O(1) memory would probably require a proof of the Collatz conjecture or something, or be a proof.)
Emanuel Landeholm over 7 years

Still, hashing is O(1). The main problem with memoization is memory usage. For unbounded input you really need tunable memoization. And you need to prove empirically that caching really helps. It really depends on the usage. If all you need is collatz(range(1, N)), caching probably (provably?) wins. For other use cases, naive caching may well lose.
Peter Cordes over 7 years

Ok, you're talking about unbounded input, but I'm talking about ranges of N small enough to check all of them, with only simple tricks like Veedrac's observation that all 0..N/2 have a double that's one longer. If most values in a range have short sequences, short enough that a cache miss takes longer than running the whole sequence with cleverly-applied brute force, then memoization loses. (Out-of-order execution can give us multiple oustanding cache-misses in parallel, though, if there are few enough instructions for the ROB to not fill up before the hash function computes an addr for N+1)
Peter Cordes over 7 years

We can maybe make memoization cheaper by only storing the dense part of the results. Set an upper limit on N, and above that, don't even check memory. Below that, use hash(N) -> N as the hash function, so key = position in the array, and doesn't need to be stored. An entry of 0 means not present yet. We can further optimize by only storing odd N in the table, so the hash function is n>>1, discarding the 1. Write the step code to always end with a n>>tzcnt(n) or something to make sure it's odd.
Peter Cordes over 7 years

That's based on my (untested) idea that very large N values in the middle of a sequence are less likely to be common to multiple sequences, so we don't miss out on too much from not memoizing them. Also that a reasonably-sized N will be part of many long sequences, even ones that start with very large N. (This may be wishful thinking; if it's wrong then only caching a dense range of consecutive N may lose out vs. a hash table that can store arbitrary keys.) Have you done any kind of hit-rate testing to see if nearby starting N tend to have any similarity in their sequence values?
Peter Cordes over 7 years

There is some pattern to the branching; e.g. an odd number is always followed by an even number. But sometimes 3n+1 leaves multiple trailing zero bits, and that's when this will mispredict. I started writing about division in my answer, and didn't address this other big red flag in the OP's code. (Note also that using a parity condition is really weird, compared to just JZ or CMOVZ. It's also worse for the CPU, because Intel CPUs can macro-fuse TEST/JZ, but not TEST/JPE. Agner Fog says AMD can fuse any TEST/CMP with any JCC, so in that case it's only worse for human readers)
Peter Cordes over 7 years

In my first comment under this answer, I left out a word: inc doesn't "lead to a slowdown on anything except maybe P4". Unfortunately there's no way to fix it now, unless a moderator could do that for me. I think deleting the explanation of why branching on ZF works there would be worse overall.
Emanuel Landeholm over 7 years

Well, for bounded input you could always pre-compute all of the sequences :) I haven't done any hard and fast hit-rate testing. It's nontrivial to simulate "real" usage agnostically. But I have found that you really can tune memory/performance by only caching certain tails.
gnasher729 over 7 years

You can just store pre-computed results for all n < N, for some large N. So you don't need the overhead of a hash table. The data in that table will be used eventually for every starting value. If you just want to confirm that the Collatz sequence always ends in (1, 4, 2, 1, 4, 2, ...): This can be proven to be equivalent to proving that for n > 1, the sequence will eventually be less than the original n. And for that, caching tails will not help.
Darren Ringer over 7 years

"@luk32 - But the author of the question can not be any argument at all, because his knowledge of assembly language is close to zero. " -- Hmm, an excellent example of the "No True Scotsman" fallacy. All the more apt, because as everyone knows, all true scotsmen are world-class asm programmers.
johnfound over 7 years

@DarrenRinger - Yes you seems to be very smart, knowing about the NTS fallacy. But in this case it is not applicable at all. Read it carefully: "Any comparison man vs compiler implicitly assumes that the man knows assembly language at least at some middle level". Now disprove it if you can.
Peter Cordes over 7 years

Err, not shift bits between mantissa and exponent, but normalize the mantissa with a shift, and add the shift amount to the exponent.
Joe over 7 years

I don't understand C++, asm, or anything you talked about but it's easy to spot an answer that's this high-quality. Have an upvote :)
Peter Cordes over 7 years

1): adding 3 times would be dumb compared to LEA. Also mul rbx on the OP's Haswell CPU is 2 uops with 3c latency (and 1 per clock throughput). imul rcx, rbx, 3 is only 1 uop, with the same 3c latency. Two ADD instructions would be 2 uops with 2c latency.
Peter Cordes over 7 years

2) ADD 1 is probably faster than INC here. Nope, the OP is not using a Pentium4. Your point 3) is the only correct part of this answer.
Darren Ringer over 7 years

@johnfound of course, I cannot, I was mainly just trying to lighten the discussion. But the whole point of a compiler is to save the programmer from being constantly tested on the cleverness of his ASM, which in most programmers is frequently inferior.
Peter Cordes over 7 years

4) sounds like total nonsense. 64-bit code can be slower with pointer-heavy data structures, because larger pointers means bigger cache footprint. But this code is working only in registers, and code alignment issues are the same in 32 and 64 bit mode. (So are data alignment issues, no clue what you're talking about with alignment being a bigger issue for x86-64). Anyway, the code doesn't even touch memory inside the loop.
Tyler Durden over 7 years

The commenter has no idea what is talking about. Do a MOV+MUL on a 64-bit CPU will be roughly three times slower than adding a register to itself twice. His other remarks are equally incorrect.
Peter Cordes over 7 years

Well MOV+MUL is definitely dumb, but MOV+ADD+ADD is still silly (actually doing ADD RBX, RBX twice would multiply by 4, not 3). By far the best way is lea rax, [rbx + rbx*2]. Or, at the cost of making it a 3-component LEA, do the +1 as well with lea rax, [rbx + rbx*2 + 1] (3c latency on HSW instead of 1, as I explained in my answer) My point was that 64-bit multiply is not very expensive on recent Intel CPUs, because they have insanely fast integer multiply units (even compared to AMD, where the same MUL r64 is 6c latency, with one per 4c throughput: not even fully pipelined.
Peter Cordes over 7 years

Another editing error in my prev comment: DIVPD is somewhat faster than div r32 on Haswell. DIVPS is 1 uop, 10-20c latency, one per 8-14c throughput. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput. (div r32 is 10 uops, same latency, and one per 9-11c throughput.)
Peter Cordes over 7 years

For comparison, single-precision 24-bit mantissa DIVPS is 1 uop, 10-13c latency, one per 7c throughput. FDIV (80-bit) is 1uop, 10-24c lat, one per 8-18c tput. So float precision makes a big difference in performance, but FP division is always a single uop, not microcoded. Fun fact, Skylake pipelines the FP dividers much more aggressively, with one per 4c DIVPD, and one per 3c DIVPS. (Still about half throughput for 256b vectors, but same latency, so I guess it's not so much extra pipelining as another partially-pipelined 128b divider that YMM vectors can use in parallel).
Peter Cordes over 7 years

I just tested on Merom. The FP and integer dividers do compete with each other. e.g. a loop which bottlenecks only on the throughput of DIVSD and div r32 runs at one per ~67 = 36+31 cycles, for operands that give a throughput of one per ~36c for div r32 and one per ~31c for DIVSD when tested separately. So they can't overlap with each other at all. (That's the upper end of Agner Fog's listed latencies, since I used inputs that would give a large quotient.) I guess integer division does use the same hardware as FP division, even DIVSS. Perhaps that's why it takes multiple uops.
Peter Cordes over 7 years

(Sorry for the spam, Dmitry. I should find somewhere else to put these results / comments).
Peter Cordes over 7 years

re: your edit. It's not plausible that any CPU design ever would have integer and / or FP division in hardware, but not efficient shifts. Some architectures (like 8-bit RISC AVR) only have a shift-by-1 instruction, but those architectures definitely don't have hardware dividers. (8086 didn't originally have immediate-count shifts, but you could mov cl, count / shr reg/m16, cl to shift by a compile-time-constant count greater than 1.) And of course this only needs a shift by 1, which every ISA has.
Dmitry Rubanovich over 7 years

@Peter Cordes, but I am not arguing that shr may not have been present in the architecture where the test was ran. In fact, I am saying that compiler most likely decides to use it. The original question was "why?" as in "why is this happening". I simply answered "probably because..." And the edit effectively retracted one of the possibilities I proposed. I guess, the "however" should be a "so." I'll change to clarify... now about that whole Collatz conjecture... There might be enough in it for a paper. :)
d33tah over 6 years

How do you know how fast instructions run and what are their latencies?
Z boson over 6 years

This question is trending at number 3 right now on ycombinator news.ycombinator.com/item?id=15070244
warl0ck over 6 years

@PeterCordes best detailed answer I have ever come across, excellently explained sir.
Peter Cordes over 6 years

Did this get linked somewhere yesterday? Over 50 upvotes in the last day, vs. the usual trickle of a couple per week on this answer.
csch over 6 years

@PeterCordes I got here via a tweet twitter.com/wycats/status/929392342041182208
Peter Cordes over 6 years

@csch: thanks. I'm glad so many people got something out of what I wrote. I'm pretty proud of it, and think it does a good job of explaining some optimization basics and specific details relevant for this problem.
Jacko over 6 years

I stand in awe of this answer's depth and clarity
rosghub over 6 years

@PeterCordes It certainly helped and inspired me. At the time I wrote this question I was messing around learning assembly re-doing old project eulers. Barely knew anything. Not long after I was able to write faster running programs for all of the first 100 problems than my C solutions, and even some of the newer 500 level problems. I particularly found problem 571 very fun to optimize in asm. I wish I chose that one to open the question for lol. Although the simplicity of problem 14 certainly helped this question grow.
Peter Cordes over 6 years

@jefferson: Collatz is really easy to grok, and is great for x86's LEA instruction, and tradeoff of latency vs. throughput (two simple-LEA having lower latency than one 3-component LEA on SnB-family). It makes a very good example because of that and the variations in algorithm (deferred right-shifting, baking in one right-shift, etc. etc.) Feel free to ask a new question about optimizing n-super-pandigital number finding, even though it sounds like a pan-dimensional gargleblaster.
supercat almost 6 years

@PeterCordes: If one is running an algorithm on values much larger than a machine word, given an algorithm to produce a scaling power and residue given the bottom N bits, one could use a recursive algorithm to produce those things from the bottom 2N bits while requiring only one multiplication and addition of the bits beyond that.
phuclv almost 6 years

XOR RSI, RSI is not a good idea. Use xor esi, esi instead
paulm almost 5 years

Have compilers got good enough to match your hand written one yet? ;)
Peter Cordes almost 5 years

@paulm: nope, gcc.godbolt.org/z/wxtWWC shows ICC19, and gcc and clang latest nightly builds, still don't use the CF output of shr, even with -march=skylake or -march=znver1. GCC 7 and later decides to actually branch instead of using CMOV. Interesting choice; wonder how well it predicts. I haven't tested. I also haven't looked at what we'd get with profile-guided optimization.
Evgenii Klepilin about 4 years

Dang, this answer deserves applause and scientific paper publishing!
gnasher729 over 3 years

Instead of just unrolling the loop: Let x = 256a + b, 0 ≤ b < 256. Then the value of b determines all the decisions up to the point where you have performed eight divisions by 2. For example 256a+3 -> 3*256a+10->3*128a+5->9*128a+16->...->9*8a+1 -> 27*8a+4-> -> 27*2a+1 -> 81*2a+4 -> 81a + 2. Build a table that would contain for example the values 81 and 2 for b = 3, and you can do in this case 12 steps with one shift, multiply and add.
nvmd over 2 years

@PeterCordes I barely understood 10% of the answer. How can I learn what you know?
Peter Cordes over 2 years

@nvmd: Read Agner Fog's manuals, especially the microarchitecture guide for Sandybridge, and David Kanter's realworldtech.com/sandy-bridge deep dive. Also en.wikichip.org/wiki/amd/microarchitectures/zen_2#Architectu‌re has some good info on modern ISAs like Zen 2. Anyway the key thing is to get a mental model of how an OoO exec pipeline processes instructions, and how front-end throughput vs. latency vs. back-end bottlenecks can limit loops. Play around with optimizing asm by hand for some loop you find interesting, you'll learn a lot from that.
Peter Cordes over 2 years

@nvmd: Also, spend ~5 years following [assembly] [x86*] [sse*][avx*] and related tags on Stack Overflow, and researching + answering performance questions that come up, and reading other people's answers on them. (When I wrote this answer back in 2016, I'd been getting into serious asm optimization and doing that for close to 2 years, although long before that I'd been interested in how low-level stuff in general like Linux system calls worked, and in how CPUs worked and read Anandtech and other similar sites reviewing new CPU microarchitectures when they came out.)
nvmd over 2 years

@PeterCordes Thanks for the tips, but I meant a "where to start" sort of thing. My understanding of Computer Architecture in general is pretty basic and high level, and my understanding of CPU architecture can be summed up by this 2 part series by Intel youtube.com/watch?v=vgPFzblBh7w I know about Fog's manuals, and they're too advanced to me. Only thing I got from the architecture software manuals is that they're huge. I want to learn the basics/common core because I lose interest if I start something and realize I missing the prerequisites. Any book recommendations?
Peter Cordes over 2 years

@nvmd: Yeah, Intel's SDM manuals are too huge, I haven't really read them, just occasional parts while researching one thing. As for where to start, IDK, I knew some asm from playing around with things on my own and looking at compiler output, and was able to understand Agner Fog's microarch PDF when I read it. Over multiple readings to look up stuff when it was relevant to different asm loops I was optimzing, I got a deeper understanding of how the pieces fit together. But if you can't even get a foothold with Agner Fog's general optimization manual, IDK, learn more asm.
Peter Cordes over 2 years

@nvmd: Another good place to start is looking at compiler output; see Matt Godbolt's talk linked from How to remove "noise" from GCC/clang assembly output?, and the rest of the Q&A.
Jim over 2 years

What a wonderful answer!