Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

python c performance haskell erlang

149,542

Solution 1

Using GHC 7.0.3, gcc 4.4.6, Linux 2.6.29 on an x86_64 Core2 Duo (2.5GHz) machine, compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C.

Your C routine runs in 8.4 seconds (faster than your run probably because of -O3)
The Haskell solution runs in 36 seconds (due to the -O2 flag)
Your factorCount' code isn't explicitly typed and defaulting to Integer (thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) using Int and the time changes to 11.1 seconds
in factorCount' you have needlessly called fromIntegral. A fix results in no change though (the compiler is smart, lucky for you).
You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds.
factorCount' is constantly applying two extra arguments that never change (number, sqrt). A worker/wrapper transformation gives us:

 $ time ./so
 842161320  

 real    0m7.954s  
 user    0m7.944s  
 sys     0m0.004s

That's right, 7.95 seconds. Consistently half a second faster than the C solution. Without the -fllvm flag I'm still getting 8.182 seconds, so the NCG backend is doing well in this case too.

Conclusion: Haskell is awesome.

Resulting Code

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
  where
  go candidate count
    | candidate > sqrt = count
    | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
    | otherwise = go (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

EDIT: So now that we've explored that, lets address the questions

Question 1: Do erlang, python and haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

In Haskell, using Integer is slower than Int but how much slower depends on the computations performed. Luckily (for 64 bit machines) Int is sufficient. For portability sake you should probably rewrite my code to use Int64 or Word64 (C isn't the only language with a long).

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

That was what I answered above. The answer was

0) Use optimization via -O2
1) Use fast (notably: unbox-able) types when possible
2) rem not mod (a frequently forgotten optimization) and
3) worker/wrapper transformation (perhaps the most common optimization).

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, that wasn't the issue. Good work and glad you considered this.

Solution 2

There are some problems with the Erlang implementation. As baseline for the following, my measured execution time for your unmodified Erlang program was 47.6 seconds, compared to 12.7 seconds for the C code.

The first thing you should do if you want to run computationally intensive Erlang code is to use native code. Compiling with erlc +native euler12 got the time down to 41.3 seconds. This is however a much lower speedup (just 15%) than expected from native compilation on this kind of code, and the problem is your use of -compile(export_all). This is useful for experimentation, but the fact that all functions are potentially reachable from the outside causes the native compiler to be very conservative. (The normal BEAM emulator is not that much affected.) Replacing this declaration with -export([solve/0]). gives a much better speedup: 31.5 seconds (almost 35% from the baseline).

But the code itself has a problem: for each iteration in the factorCount loop, you perform this test:

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

The C code doesn't do this. In general, it can be tricky to make a fair comparison between different implementations of the same code, and in particular if the algorithm is numerical, because you need to be sure that they are actually doing the same thing. A slight rounding error in one implementation due to some typecast somewhere may cause it to do many more iterations than the other even though both eventually reach the same result.

To eliminate this possible error source (and get rid of the extra test in each iteration), I rewrote the factorCount function as follows, closely modelled on the C code:

factorCount (N) ->
    Sqrt = math:sqrt (N),
    ISqrt = trunc(Sqrt),
    if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
       true          -> factorCount (N, ISqrt, 1, 0)
    end.

factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
factorCount ( N, ISqrt, Candidate, Count) ->
    case N rem Candidate of
        0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
        _ -> factorCount (N, ISqrt, Candidate + 1, Count)
    end.

This rewrite, no export_all, and native compilation, gave me the following run time:

$ erlc +native euler12.erl
$ time erl -noshell -s euler12 solve
842161320

real    0m19.468s
user    0m19.450s
sys 0m0.010s

which is not too bad compared to the C code:

$ time ./a.out 
842161320

real    0m12.755s
user    0m12.730s
sys 0m0.020s

considering that Erlang is not at all geared towards writing numerical code, being only 50% slower than C on a program like this is pretty good.

Finally, regarding your questions:

Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Yes, somewhat. In Erlang, there is no way of saying "use 32/64-bit arithmetic with wrap-around", so unless the compiler can prove some bounds on your integers (and it usually can't), it must check all computations to see if they can fit in a single tagged word or if it has to turn them into heap-allocated bignums. Even if no bignums are ever used in practice at runtime, these checks will have to be performed. On the other hand, that means you know that the algorithm will never fail because of an unexpected integer wraparound if you suddenly give it larger inputs than before.

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, your Erlang code is correct with respect to last call optimization.

Solution 3

In regards to Python optimization, in addition to using PyPy (for pretty impressive speed-ups with zero change to your code), you could use PyPy's translation toolchain to compile an RPython-compliant version, or Cython to build an extension module, both of which are faster than the C version in my testing, with the Cython module nearly twice as fast. For reference I include C and PyPy benchmark results as well:

C (compiled with gcc -O3 -lm)

% time ./euler12-c 
842161320

./euler12-c  11.95s 
 user 0.00s 
 system 99% 
 cpu 11.959 total

PyPy 1.5

% time pypy euler12.py
842161320
pypy euler12.py  
16.44s user 
0.01s system 
99% cpu 16.449 total

RPython (using latest PyPy revision, c2f583445aee)

% time ./euler12-rpython-c
842161320
./euler12-rpy-c  
10.54s user 0.00s 
system 99% 
cpu 10.540 total

Cython 0.15

% time python euler12-cython.py
842161320
python euler12-cython.py  
6.27s user 0.00s 
system 99% 
cpu 6.274 total

The RPython version has a couple of key changes. To translate into a standalone program you need to define your target, which in this case is the main function. It's expected to accept sys.argv as it's only argument, and is required to return an int. You can translate it by using translate.py, % translate.py euler12-rpython.py which translates to C and compiles it for you.

# euler12-rpython.py

import math, sys

def factorCount(n):
    square = math.sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in xrange(1, isquare + 1):
        if not n % candidate: count += 2
    return count

def main(argv):
    triangle = 1
    index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle
    return 0

if __name__ == '__main__':
    main(sys.argv)

def target(*args):
    return main, None

The Cython version was rewritten as an extension module _euler12.pyx, which I import and call from a normal python file. The _euler12.pyx is essentially the same as your version, with some additional static type declarations. The setup.py has the normal boilerplate to build the extension, using python setup.py build_ext --inplace.

# _euler12.pyx
from libc.math cimport sqrt

cdef int factorCount(int n):
    cdef int candidate, isquare, count
    cdef double square
    square = sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in range(1, isquare + 1):
        if not n % candidate: count += 2
    return count

cpdef main():
    cdef int triangle = 1, index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle

# euler12-cython.py
import _euler12
_euler12.main()

# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("_euler12", ["_euler12.pyx"])]

setup(
  name = 'Euler12-Cython',
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)

I honestly have very little experience with either RPython or Cython, and was pleasantly surprised at the results. If you are using CPython, writing your CPU-intensive bits of code in a Cython extension module seems like a really easy way to optimize your program.

Solution 4

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

The C implementation is suboptimal (as hinted at by Thomas M. DuBuisson), the version uses 64-bit integers (i.e. long datatype). I'll investigate the assembly listing later, but with an educated guess, there are some memory accesses going on in the compiled code, which make using 64-bit integers significantly slower. It's that or generated code (be it the fact that you can fit less 64-bit ints in a SSE register or round a double to a 64-bit integer is slower).

Here is the modified code (simply replace long with int and I explicitly inlined factorCount, although I do not think that this is necessary with gcc -O3):

#include <stdio.h>
#include <math.h>

static inline int factorCount(int n)
{
    double square = sqrt (n);
    int isquare = (int)square;
    int count = isquare == square ? -1 : 0;
    int candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    int triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index++;
        triangle += index;
    }
    printf ("%d\n", triangle);
}

Running + timing it gives:

$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12
842161320
./euler12  2.95s user 0.00s system 99% cpu 2.956 total

For reference, the haskell implementation by Thomas in the earlier answer gives:

$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12                                                                                      [9:40]
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12 ...
842161320
./euler12  9.43s user 0.13s system 99% cpu 9.602 total

Conclusion: Taking nothing away from ghc, its a great compiler, but gcc normally generates faster code.

Solution 5

Take a look at this blog. Over the past year or so he's done a few of the Project Euler problems in Haskell and Python, and he's generally found Haskell to be much faster. I think that between those languages it has more to do with your fluency and coding style.

When it comes to Python speed, you're using the wrong implementation! Try PyPy, and for things like this you'll find it to be much, much faster.

View more solutions

149,542

Author by

Hyperboreus

Obsessive compulsive coder. PEP8: it is not a guide, it is a law graven in stone handed down from the heavens by Allah/Yahwe/God/Junjunahpu/Moctezuma/your-favourite-imaginary-buddy. Everytime you write in python for i in range(len(someList)) you tear down the veil between our world and the nether realms and HE who waits behind the wall feasts and gores on the tears of ravaged minds.

Updated on December 20, 2021

Comments

Hyperboreus over 2 years

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

Python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

Python with PyPy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

Erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

Haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

C: 100%
Python: 692% (118% with PyPy)
Erlang: 436% (135% thanks to RichardC)
Haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

EDIT:

Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.

Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

brandizzi almost 13 years

@Hyperboreus thank you! Also, I am really curious about your next questions. However, I warn you that my knowledge is limited so I could not answer every question of yours. For trying to compensate it I made my answer community wiki so people can more easily complement it.
Karl Knechtel almost 13 years

Why is there such a difference in speed between rem and mod?
Thomas M. DuBuisson almost 13 years

@Karl Because rem is actually a sub-component of the mod operation (they aren't the same). If you look in the GHC Base library you see mod tests for several conditions and adjusts the sign accordingly. (see modInt# in Base.lhs)
C. A. McCann almost 13 years

Another data point: I wrote a quick Haskell translation of the C program without looking at @Hyperboreus's Haskell. So it's a bit closer to standard idiomatic Haskell, and the only optimization I added deliberately is replacing mod with rem after reading this answer (heh, oops). See the link for my timings, but the short version is "almost identical to the C".
Hyperboreus almost 13 years

Strange things are passings. On my machine (PhenomX4, kernel 2.6.38) with ghc 7.0.3 (built from source), compiling it with -O2 -fllvm -fforce-recomp, changes the run time from 2:37 to 2:33. When I copy and paste your optimized code the time goes up to 2:34. I cannot understand this? Another thing: How do I implement this worker/wrapper transformation in erlang?
Hyperboreus almost 13 years

About using range. When I replace the range with a while loop with increment (mimicking the for loop of C), the execution time actually doubles. I guess generators are quite optimized.
Bayquiri almost 13 years

It looks like Int is enough to get a correct answer (C solution uses long, which is a synonym to int anyway). Also, on my machine (Core 2 Duo 2.4GHz Ubuntu x86) this solution runs in 2.7s with Int, in 18s with Integer and in 33s with Int64. @Hyperboreus, perhaps you are running a wrong executable then?
Seth Carnegie almost 13 years

Even thought the C version ran faster on my machine, I have a new respect for Haskell now. +1
Thomas M. DuBuisson almost 13 years

@hyperboreus I suggest you join #haskell on irc.freenode.net - Stackoverflow comments are notoriously bad for working out issues that are due to different compiler / machine configurations.
Thomas M. DuBuisson almost 13 years

@Seth That's interesting. I did try to add const to the right variables and use LLVM for the C program but failed to get any notable change in performance without changing the algorithm.
Muzaaya Joshua almost 13 years

I agree with you. This benchmark was not precise especially for Erlang for a number of reasons
RichardC almost 13 years

I forgot to mention it in my post, but I did measure the time it takes just to start the system (erl -noshell -s erlang halt) - around 0.1 second on my machine. This is small enough in comparison to the run time of the program (around 10 seconds) that it's not worth quibbling about.
kizzx2 almost 13 years

This is quite surprising to me, though I have yet to try it. Since the original factorCount' was tail recursive I would have thought the compiler could spot the extra parameters not being changed and optimize the tail recursion only for the changing parameters (Haskell being a pure language after all, this should be easy). Anyone thinks the compiler could do that or should I go back to read more theory papers?
hammar almost 13 years

@kizzx2: There's a GHC ticket to have it added. From what I've understood, this transformation can result in additional allocations of closure objects. This means worse performance in some cases, but as Johan Tibell suggests in his blog post this can be avoided if the resulting wrapper can be inlined.
C. A. McCann almost 13 years

@kizzx2: As with many easy and obvious optimizations in Haskell, the difficulty is not in how to do it, the difficulty is in telling when to do it. Unreliable optimizations that sometimes make things worse are a bigger problem than no optimization at all.
Muzaaya Joshua almost 13 years

on your machine! we do not know whether you are working on a sun fire server!. Since time is a variable proportional to the machine specs, it should be taken into consideration.... quibbling?
Thomas M. DuBuisson almost 13 years

Very nice! For comparison, on my machine your C solution runs in 2.5 seconds while a similar modification to the Haskell code (moving to Word32, adding INLINE pragma) results in a runtime of 4.8 seconds. Perhaps something can be done (not trivaily, it seems) - the gcc result is certainly impressive.
Admin almost 13 years

Thanks! Perhaps the question should be the speed of compiled output by various compilers rather than the actual language itself. Then again, pulling out the Intel manuals and optimising by hand will still win outright (provided you have the knowledge and the time (a lot of)).
amindfv over 12 years

Why does a non-specialized polymorphic function slow the code down by a factor of 3??
Daniel Fischer over 12 years

Thomas, actually the original code that runs isn't polymorphic, even when compiled without optimisations (with ghc-7.*; it is polymorphic without optimisations with 6.12.3 - with optimisations, it's monomorphic for all of them). It's using Integer (except for count, which is Int by fromEnum) without type signatures, since nothing except main is exported. If you force polymorphic code, that's a good deal slower than Integer (about 2× for me).
Thomas M. DuBuisson over 12 years

@DanielFischer Good catch. In retrospect, I'm not sure what led me to think this was a specialization issue (which makes no sense - as my confused comment said). I'm surprised it took so long for someone to correct me!
Exception over 10 years

@RichardC Nowhere mentioned that Erlang is faster :) It has different goals, not speed!
Mark Washeim almost 10 years

increasing the value to 1000 as below does not obtain the correct result. With > 500 as above, newest test: IntelCore2 CPU 6600 @ 2.40GHz comletes in real 0m2.370s
Thomas M. DuBuisson over 9 years

This is rather explicitly not what the asker requested, "I really tried to implement the same algorithm as similar as possible in the four languages". To quote a comment on one of the many deleted answers similar to yours "it's pretty obvious you can get faster speeds with a better algorithm regardless of language."
user3125280 over 9 years

@ThomasM.DuBuisson. That's what i'm getting at. The question\answers heavily imply that algorithmic speed ups are significant (and of course OP isn't asking for them), but there is no explicit example. I think this answer - which isn't exactly heavily optimised code - provides a little useful context for anyone, like myself, who wondered just how slow/fast the OP's code was.
davidDavidson over 9 years

your result: 76576500 everyone else: 842161320 ther is something wronge with your result
Alexey Shmalko about 9 years

Changing long to int in original C program makes it run twice faster (producing correct results). Unfortunately, Haskell isn't faster than C in this case.
Thomas M. DuBuisson about 9 years

@AlexeyShmalko Sure, you can continue to optimize whichever language's implementation and compiler. See Raedwulf's answer for example.
zxq9 about 9 years

Last I checked Haskell "primes" was just a huge list of precomputed primes -- no computation, just lookup. So yes, of course this will be faster, but it tells you nothing about the computational speed of deriving primes in Haskell.
Fraser about 9 years

@zxq9 could you point out to me where in the primes package source (hackage.haskell.org/package/primes-0.2.1.0/docs/src/…) the list of prime numbers are located?
Display Name almost 9 years

I'm curious, can the C version be optimized to be at least as fast as CPython?
Mark Washeim over 8 years

Since I was dong some other Euler problems, I just checked my result. The answer to projecteuler.net/problem=12 is 76576500 no question about it. I know it seems odd, but I just checked.
semicolon over 8 years

Whilst the source shows the primes are not precomputed, this speed up is utterly insane, miles faster than the C version, so what the heck is going on?
Arthur about 8 years

gcc can even pre-computes lot of patterns. int a = 0; for(int i=0;i<10000000;++i) { a+= i;} will be computed at compile time, so take < 1ms at runtime. It counts too
Hauleth about 8 years

@semicolon memorization. In this case I think that Haskell memorized all primes at runtime so it doesn't need to recompute them each iteration.
semicolon about 8 years

@hauleth What do you mean by each iteration? Does measure-command run the executable multiple times for accuracy? That seems expensive seeing as one of the executables takes multiple seconds per "iteration".
Hauleth about 8 years

No, I meant map iteration. Haskell, even if has no loops, still allows iterations, but expressed as recursive calls. In this solution we have call to primeFactors which probably calls something like primes underneath. This function probably returns list of primes which can be memoized, so next call will compute only missing top of primes, not all from the bottom up like code in C does.
DarthGizka about 8 years

@Thomas: I must agree with user3125280 - the languages should be compared how they fare doing something smart instead of how they fail to beat a real programming language at doing something dumb. Smart algorithms usually care less about microscopic efficiencies than about flexibility, the ability to wire things up (combine them) and infrastructure. The point is not so much whether one gets 20 ms or 50 ms, it is not getting 8 seconds or 8 minutes.
kirbyfan64sos about 8 years

Out of curiosity, how do the Intel compilers do? They're usually really good at optimizing numerical code.
Casper Færgemand over 7 years

It's 1000 divisors, not 500.
thanos over 7 years

Just for comparison: original c version: 9.1690, thaumkid's version: 0.1060 86x improvement.
thanos over 7 years

For comparison I get 9.03 with the original c version while using Erlang 19 with Mark's code I get 5.406, 167.0366% faster.
Novice C over 6 years

This thread is classic straw man. We are presented with several several unoptimized codes and everyone takes the code from their camp, optimizes it and goes look it's the fast language out there. Top answers falsely claims Haskell is the fastest, next laughably says Python is the fast, and then someone gives an actual optimized C code that beats the top answers Haskell. There is even a comment which says Mathematica is the fastest...
Thomas M. DuBuisson over 6 years

You raise a great point - no one should ever read too much into a microbenchmark or for that matter read too much into the value of a language based on its performance. Above I only claim the resulting Haskell ran faster than the asker's provided C implementation. I also commented on the optimized C implementation mentioning I could not produce similarly performing Haskell at the time and commented on a Mathematica solution that was using a different algorithm - which simultaneously went against the grain of the question while also emphasizing exactly the answer's point.
ShadowRanger over 6 years

Where do you find a reference saying the C spec guarantees "int is at least 32 bits"? The only guarantees I know of are the minimum magnitudes of INT_MIN and INT_MAX (-32767 and 32767, which practically impose a requirement that int be at least 16 bits). long is required to be at least as large as an int, and the range requirements mean long is at least 32 bits.
user5994461 over 6 years

You appear to be correct. stackoverflow.com/questions/1231147/is-int-in-c-always-32-bi‌t
Eli Korvigo about 6 years

@SargeBorsch that Cython version is so fast, because it compiles down to a highly optimised C source, which means you sure can get that performance out of C.
matbrgz almost 6 years

@NoviceC So there is room for an answer which compares all the highly optimized versions?
Piyush Katariya almost 6 years

Wow. Haskell performs great once you avoid inferred types
codeshot almost 6 years

I tried with 500 and 1000 divisors: 50ms and 240ms respectively. The earlier Haskell answer got me 5.4s, as did the earlier C answer
codeshot almost 6 years

Actually it's not the inference that did it. That only helps you A) debug or avoid type problems and typeclass instance selection problems B) debug and avoid some undecidable type problems with a few modern language extensions. It also helps you to make your programs uncomposable so you can never scale up your development efforts.
codeshot almost 6 years

c version 0.11 s on Intel skull canyon
Big Temp about 4 years

@ThorbjørnRavnAndersen Not really. You cannot compare speed of execution of different code in different languages compiled by different tools and expect something meaningful to come out of it. If someone was assblasted enough, they would just write the most optimal C they could, maybe throw in #pragma omp parallel for reduction, sprinkle a bit of inline assembly on top and compile it with ICC for their Intel machine effectively ending the discussion because nothing from the list could ever compete with that.
vonbrand almost 3 years

The link to the blog is dead.