Why does JavaScript appear to be 4 times faster than C++?
Solution 1
I may have some bad news for you if you're on a Linux system (which complies with POSIX at least in this situation). The clock()
call returns number of clock ticks consumed by the program and scaled by CLOCKS_PER_SEC
, which is 1,000,000
.
That means, if you're on such a system, you're talking in microseconds for C and milliseconds for JavaScript (as per the JS online docs). So, rather than JS being four times faster, C++ is actually 250 times faster.
Now it may be that you're on a system where CLOCKS_PER_SECOND
is something other than a million, you can run the following program on your system to see if it's scaled by the same value:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define MILLION * 1000000
static void commaOut (int n, char c) {
if (n < 1000) {
printf ("%d%c", n, c);
return;
}
commaOut (n / 1000, ',');
printf ("%03d%c", n % 1000, c);
}
int main (int argc, char *argv[]) {
int i;
system("date");
clock_t start = clock();
clock_t end = start;
while (end - start < 30 MILLION) {
for (i = 10 MILLION; i > 0; i--) {};
end = clock();
}
system("date");
commaOut (end - start, '\n');
return 0;
}
The output on my box is:
Tuesday 17 November 11:53:01 AWST 2015
Tuesday 17 November 11:53:31 AWST 2015
30,001,946
showing that the scaling factor is a million. If you run that program, or investigate CLOCKS_PER_SEC
and it's not a scaling factor of one million, you need to look at some other things.
The first step is to ensure your code is actually being optimised by the compiler. That means, for example, setting -O2
or -O3
for gcc
.
On my system with unoptimised code, I see:
Time Cost: 320ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
Time Cost: 300ms
a = 2717999973.760710
and it's three times faster with -O2
, albeit with a slightly different answer, though only by about one millionth of a percent:
Time Cost: 140ms
Time Cost: 110ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
Time Cost: 100ms
a = 2718000003.159864
That would bring the two situations back on par with each other, something I'd expect since JavaScript is not some interpreted beast like in the old days, where each token is interpreted whenever it's seen.
Modern JavaScript engines (V8, Rhino, etc) can compile the code to an intermediate form (or even to machine language) which may allow performance roughly equal with compiled languages like C.
But, to be honest, you don't tend to choose JavaScript or C++ for its speed, you choose them for their areas of strength. There aren't many C compilers floating around inside browsers and I've not noticed many operating systems nor embedded apps written in JavaScript.
Solution 2
Doing a quick test with turning on optimization, I got results of about 150 ms for an ancient AMD 64 X2 processor, and about 90 ms for a reasonably recent Intel i7 processor.
Then I did a little more to give some idea of one reason you might want to use C++. I unrolled four iterations of the loop, to get this:
#include <stdio.h>
#include <ctime>
int main() {
double a = 3.1415926, b = 2.718;
double c = 0.0, d=0.0, e=0.0;
int i, j;
clock_t start, end;
for(j=0; j<10; j++) {
start = clock();
for(i=0; i<100000000; i+=4) {
a += b;
c += b;
d += b;
e += b;
}
a += c + d + e;
end = clock();
printf("Time Cost: %fms\n", (1000.0 * (end - start))/CLOCKS_PER_SEC);
}
printf("a = %lf\n", a);
return 0;
}
This let the C++ code run in about 44ms on the AMD (forgot to run this version on the Intel). Then I turned on the compiler's auto-vectorizer (-Qpar with VC++). This reduced the time a little further still, to about 40 ms on the AMD, and 30 ms on the Intel.
Bottom line: if you want to use C++, you really need to learn how to use the compiler. If you want to get really good results, you probably also want to learn how to write better code.
I should add: I didn't attempt to test a version under Javascript with the loop unrolled. Doing so might provide a similar (or at least some) speed improvement in JS as well. Personally, I think making the code fast is a lot more interesting than comparing Javascript to C++.
If you want code like this to run fast, unroll the loop (at least in C++).
Since the subject of parallel computing arose, I thought I'd add another version using OpenMP. While I was at it, I cleaned up the code a little bit, so I could keep track of what was going on. I also changed the timing code a bit, to display the overall time instead of the time for each execution of the inner loop. The resulting code looked like this:
#include <stdio.h>
#include <ctime>
int main() {
double total = 0.0;
double inc = 2.718;
int i, j;
clock_t start, end;
start = clock();
#pragma omp parallel for reduction(+:total) firstprivate(inc)
for(j=0; j<10; j++) {
double a=0.0, b=0.0, c=0.0, d=0.0;
for(i=0; i<100000000; i+=4) {
a += inc;
b += inc;
c += inc;
d += inc;
}
total += a + b + c + d;
}
end = clock();
printf("Time Cost: %fms\n", (1000.0 * (end - start))/CLOCKS_PER_SEC);
printf("a = %lf\n", total);
return 0;
}
The primary addition here is the following (admittedly somewhat arcane) line:
#pragma omp parallel for reduction(+:total) firstprivate(inc)
This tells the compiler to execute the outer loop in multiple threads, with a separate copy of inc
for each thread, and adding together the individual values of total
after the parallel section.
The result is about what you'd probably expect. If we don't enable OpenMP with the compiler's -openmp
flag, the reported time is about 10 times what we saw for individual executions previously (409 ms for the AMD, 323 MS for the Intel). With OpenMP turned on, the times drop to 217 ms for the AMD, and 100 ms for the Intel.
So, on the Intel the original version took 90ms for one iteration of the outer loop. With this version we're getting just slightly longer (100 ms) for all 10 iterations of the outer loop -- an improvement in speed of about 9:1. On a machine with more cores, we could expect even more improvement (OpenMP will normally take advantage of all available cores automatically, though you can manually tune the number of threads if you want).
Solution 3
Even if the post is old, I think it may be interesting to add some information. In summary, your test is too vague and may be biased.
A bit about speed testing methodology
When comparing speed of two languages, you first have to define precisely in which context you want to compare how they perform.
-
"naive" vs "optimized" code : whether or not code tested is made by a beginner or expert programmer. This parameter matters depending on who will participate in your project. For example, when working with scientists (non geeky ones), you will look more for "naive" code performance, because scientists aren't forcibly good programmers.
-
authorized compile time : whether you consider you allow the code to build for long or not. This parameter can matter depending on your project management methodology. If you need to do automated tests, maybe trading a bit of speed to decrease compile time can be interesting. On the other hand, you can consider that distribution version is allowing a high amount of building time.
-
Platform portability : if your speed shall be compared on one platform or more (Windows, Linux, PS4...)
-
Compiler/interpreter portability : if your code's speed shall be compiler/interpreter independent or not. Can be useful for multiplatform and/or open source projects.
-
Other specialized parameters, as for example if you allow dynamic allocations in your code, if you want to enable plugins (dynamically loaded library at runtime) etc.
Then, you have to make sure that your code is representative of what you want to test
Here, (I assume you didn't compiled C++ with optimization flags), you are testing fast-compile speed of "naive" (not so naive actually) code. Because your loop is fixed size, with fixed data, you don't test dynamic allocations, and you -supposedly- allow code transformations (more on that in the next section). And effectively, JavaScript performs usually better than C++ in this case, because JavaScript optimizes at compile time by default, while C++ compilers needs to be told to optimize.
A quick overview of C++ speed increase with parameters
Because I am not knowledgeable enough about JavaScript, I'll only show how code optimization and compilation type can change c++ speed on a fixed for loop, hoping it will answer the question on "how JS can appear to be faster than C++ ?"
For that let's use Matt Godbolt's C++ compiler explorer to see the assembly code generated by gcc9.2
Non optimized code
float func(){
float a(0.0);
float b(2.71);
for (int i = 0; i < 100000; ++i){
a = a + b;
}
return a;
}
compiled with : gcc 9.2, flag -O0. Produces the following assembly code :
func():
pushq %rbp
movq %rsp, %rbp
pxor %xmm0, %xmm0
movss %xmm0, -4(%rbp)
movss .LC1(%rip), %xmm0
movss %xmm0, -12(%rbp)
movl $0, -8(%rbp)
.L3:
cmpl $99999, -8(%rbp)
jg .L2
movss -4(%rbp), %xmm0
addss -12(%rbp), %xmm0
movss %xmm0, -4(%rbp)
addl $1, -8(%rbp)
jmp .L3
.L2:
movss -4(%rbp), %xmm0
popq %rbp
ret
.LC1:
.long 1076719780
The code for the loop is what is between ".L3" and ".L2". To be quick, we can see that the code created here is not optimized at all : a lot of memory access are made (no proper use of registers), and because of this there are a lot of wasted operations storing and reloading the result.
This introduces an extra 5 or 6 cycles of store-forwarding latency into the critical path dependency chain of FP addition into a
, on modern x86 CPUs. This is on top of the 4 or 5 cycle latency of addss
, making the function more than twice as slow.
compiler optimization
The same C++ compiled with gcc 9.2, flag -O3. Produces the following assembly code:
func():
movss .LC1(%rip), %xmm1
movl $100000, %eax
pxor %xmm0, %xmm0
.L2:
addss %xmm1, %xmm0
subl $1, %eax
jne .L2
ret
.LC1:
.long 1076719780
The code is much more concise, and uses registers as much as possible.
code optimization
A compiler optimizes code very well usually, especially C++, given that the code is expressing clearly what the programmer wants to achieve. Here we want a fixed mathematical expression to be as fast a possible, so let's change the code a bit.
constexpr float func(){
float a(0.0);
float b(2.71);
for (int i = 0; i < 100000; ++i){
a = a + b;
}
return a;
}
float call() {
return func();
}
We added a constexpr to the function to tell the compiler to try to compute it's result at compile time. And added a calling function to be sure that it will generate some code.
Compiled with gcc 9.2, -O3, leads to following assembly code :
call():
movss .LC0(%rip), %xmm0
ret
.LC0:
.long 1216623031
The asm code is short, since the value returned by func has been computed at compile time, and call simply returns it.
Of course, a = b * 100000
would always compile to efficient asm, so only write the repeated-add loop if you need to explore FP rounding error over all those temporaries.
Solution 4
This is a polarizing topic, so one may have a look at:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/
Benchmarking all kinds of languages.
Javascript V8 and such are surely doing a good job for simple loops as in the example, probably generating very similar machine code. For most "close to the user" applications Javscript surely is the better choice, but keep in mind the memory waste and the many times unavoidable performance hit (and lack of control) for more complicated algorithms/applications.
streaver91
Updated on December 14, 2021Comments
-
streaver91 over 2 years
For a long time, I had thought of C++ being faster than JavaScript. However, today I made a benchmark script to compare the speed of floating point calculations in the two languages and the result is amazing!
JavaScript appears to be almost 4 times faster than C++!
I let both of the languages to do the same job on my i5-430M laptop, performing
a = a + b
for 100000000 times. C++ takes about 410 ms, while JavaScript takes only about 120 ms.I really do not have any idea why JavaScript runs so fast in this case. Can anyone explain that?
The code I used for the JavaScript is (run with Node.js):
(function() { var a = 3.1415926, b = 2.718; var i, j, d1, d2; for(j=0; j<10; j++) { d1 = new Date(); for(i=0; i<100000000; i++) { a = a + b; } d2 = new Date(); console.log("Time Cost:" + (d2.getTime() - d1.getTime()) + "ms"); } console.log("a = " + a); })();
And the code for C++ (compiled by g++) is:
#include <stdio.h> #include <ctime> int main() { double a = 3.1415926, b = 2.718; int i, j; clock_t start, end; for(j=0; j<10; j++) { start = clock(); for(i=0; i<100000000; i++) { a = a + b; } end = clock(); printf("Time Cost: %dms\n", (end - start) * 1000 / CLOCKS_PER_SEC); } printf("a = %lf\n", a); return 0; }
-
streaver91 almost 11 yearsI think it is not the case, 400ms is something is easy to feel. The output appears really slow than javascript.
-
paxdiablo almost 11 years@user2189264, it probably takes that amount of time to start up and tear down the process. I'll update to show the "proof".
-
streaver91 almost 11 yearsI mean the time cost of my origin scripts will be printed to the screen after each big loop(10 big loops altogether). And the time of each loop is 400 ms for c++, 100ms for javascript, and these are long enough for me too feel the difference.
-
paxdiablo almost 11 years@user2189264, don't feel, measure! Feeling may be good to start a hypothesis but it's no good in evaluating it :-) In any case, printing times outside of the program being called includes stuff outside of what you're measuring (such as the afore-mentioned process startup/shutdown).
-
paxdiablo almost 11 years@user2189264, if you really want to use that method of yours, bump up the loop by a factor of 100 so that "feeling" and "measuring" start to converge. Unless your C program then takes 40 seconds to run, you can put it down to the difficulty of measuring very small times.
-
streaver91 almost 11 yearsI bet the clock returns milliseconds on my computer. I change the inner loop to 1000000000, ten times more than the original value. And the time takes for each inner loop is 4 seconds, and the output of each inner loop is about 4100ms. So either the print method can take 4 seconds and the loop take 4.1ms, or the loop really take 4100ms. The print method cannot take such long time, as in the previous case, each loop, include printing, takes only about half a second.
-
paxdiablo almost 11 yearsI suggest you go and check out what value tour implementation has for CLOCKS_PER_SEC. That will be the definitive answer. And let us know which OS you're using.
-
streaver91 almost 11 years@paxdiablo, I don't mean to doubt you. But this time, I print out the constant value CLOCKS_PER_SEC, and it is 1000. Maybe we used different platform.
-
Jerry Coffin almost 11 years@user2189264: Yes and no -- it's still executing in a single core. With a little more work (some openMP directives, for example) we could have it execute on multiple cores as well, effectively multiplying the speed again. All I've done so far though is let it make better use of the resources on a single core (exposed instruction level parallelism, not thread-level parallelism).
-
paxdiablo almost 11 yearsThat's fine, @user2189264, it just means that your platform isn't POSIX, whatever it is. I'll update the answer.
-
Matt almost 11 years@user2189264: sigh... if you have access to C++11, just use
<chrono>
-- solarianprogrammer.com/2012/10/14/… -- no reason to useCLOCKS_PER_SEC
-dependent measurement (esp. if that dependence is not taken into account when comparing...). -
Peter Cordes almost 6 yearsgcc6 and later notice that they can CSE
d
ande
out of the loop, and computec+d+e
asc +c + c
. godbolt.org/g/1BLDfX (Or with FMA, asfma(c, 2.0, c) = c*2.0 + c
. If that's legal, thenc*3.0
also would be legal...) Anyway, with only twoaddsd
in the loop (fora
andc
), it becomes more hyperthreading-friendly on CPUs whereaddsd
has a latency:throughput ratio above 2. (e.g. 3:1 on Sandybridge, 4:0.5 on Skylake). And BTW, clang auto-vectorizes with 128-bit vectors. I think it might be doing the same thing,a
separate from the 3 that start as0.0
. -
Peter Cordes almost 6 yearsAnd BTW, Athlon X2 is a K10 core, I think. Or maybe K8, either way
addsd
latency = 4, throughput = 1 per clock, so 4 accumulators is just barely enough to hide FP add latency. -
Jerry Coffin almost 6 years@PeterCordes: Yeah--I believe if you wanted to get better performance on a modern Intel, you'd want to unroll more iterations of the inner loop (around 8 or so, if memory serves). Slightly painful, but should roughly double speed.
-
Peter Cordes almost 6 yearsTurns out that just-barely-enough FP accumulators is still somewhat slower than even more, at least when data is coming from memory. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? So I guess uop scheduling doesn't do a perfect job when there are that many parallel dep chains. Probably with this case where there are never any cache misses, just registers, scheduling would do better. Anyway yeah, stuff like this is a major reason why AVX512 doubled the number of architectural vector registers.
-
Jerry Coffin almost 6 years@PeterCordes: I suppose if I'd been ambitious enough to bother, I'd have tried it with half a dozen or so, but when I'd already improved speed by ~9x, I probably didn't figure it was worth spending a lot more time and effort on further improvement.
-
Peter Cordes almost 6 years@JerryCoffin: Yeah, this specific case doesn't need more testing; the general idea is enough: unroll reductions until you bottleneck on FP throughput, not latency, for any loops that are actually hot. (And consult agner.org/optimize to find out what latency:throughput ratio other CPUs have, if you're using one with less FP-add throughput than others, e.g. Intel pre-Skylake.)
-
Jerry Coffin almost 6 years@PeterCordes: On the other hand, unrolling 4 more iterations isn't all that much work. A quick test with 8 iterations of the inner loop unrolled shows them running at around 49 ms apiece (on a Haswell). For better or worse, the Athlon X2 is long gone (closest I have is a Steamroller, which also seems to benefit at least a little from 8x unrolling of the inner loop). :-)
-
Peter Cordes almost 6 years@JerryCoffin: Steamroller's ADDSD throughput is 1 per clock, latency is 5 cycles, up from 4 in K10. (Or 6 cycles if the input isn't coming from an FP add / sub / mul (FMA unit). Most CPUs just have bypass delays between integer/FP, but Bulldozer-family has a special fast path for forwarding within the FMA domain.)
-
Carl Smith over 5 yearsHow is anyone "REKT"? The question was about why JS appeared to be faster than C++ (clearly implying that it shouldn't be). This answer explains the most probable reason. The community already has a reputation for mocking people for asking questions, which is pretty embarrassing for a community driven Q&A website.
-
Blake over 5 yearsWhy do you say — For most "close to the user" applications Javscript surely is the better choice ?
-
carbolymer over 5 yearsThis site is not reliable. For example, none of Java benchmarks include JMH, so they're essientially benchmarking JVM not test scenarios.
-
Peter Cordes over 4 yearsCPUs have cache and store-forwarding. Store/reload inside a loop only adds about 5 or 6 cycles of latency, not 1000x slower.
-
Peter Cordes over 4 yearsThis has some useful points about enabling optimization, and getting compilers to optimize away loops, but see the top answer on this question: the OP got
clock()
wrong and was comparing ms vs. us, and C++ was actually 250 times faster with the JS and C++ implementations they tested on. -
Felix Bertoni over 4 years@PeterCordes Thanks for the editing, I compared to RAM to emphasis the speed performance (since on true random acces cache have few chances to not perform well), but it lead to a big misunderstanding. Better leaving it like you edited in my opinion. EDIT : I had not seen how much spelling mistakes I made (didn't had time to reread when posting) so thank you even more for the correction
-
Peter Cordes over 4 yearsYes cache misses hurt, but the extra reloads or spill/reloads introduced by
-O0
will always be to objects that you already just touched, or to the stack. It's normally safe to assume the stack is hot in cache because of call/ret. The initial access to an object already happens in optimized code, just further access is avoided. It mis-characterizes the cost of-O0
(Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?) to make any claims about going to DRAM, except in rare case of cache conflict-misses. -
Felix Bertoni over 4 years@PeterCordes You are definitely right ! I've learnt something -actually pretty evident- I didn't realized until today !
-
teg_brightly over 4 yearsUsing Intel Q6600 this test showed 120ms for C++ and 1300ms for JavaScript. The test in the original version showed around 380ms for both C++ and JavaScript.
-
Peter Cordes about 4 yearsAn increment loop doesn't prove anything in general about a language. It's possible to make slow native code, as the OP proved by compiling with optimization disabled. But anyway, neither C++ nor JavaScript native number types can reach 10e100 (1 Googol) incrementing by 1. (Google is a company, not a number).
double
precision floating point (i.e. a JS number) can represent values as high as 10e308, but9,007,199,254,740,992 + 1
rounds back to the same number so you'd get stuck there. (i.e. 1 unit in the last place of the mantissa is 2 there) -
Peter Cordes over 2 yearsFor trading compile time vs. optimization, with automated tests you'd often want to use
-O1
or-Og
to compile quickly, not much slower than a low-effort-O0
build, but do basic things like register allocation. (But not inlining). Integration / unit tests don't necessarily need to use the same build options as full release builds. -
Felix Bertoni over 2 years@PeterCordes My point, maybe unclear, was the context of the speed requirements : do we only test speed regarding a release build, with virtually infinite compilation time, or do we still need to be fast when compiling during build process, or even debugging ? I wasn't forcibly talking about unit/integration tests, but more about system tests. Still, some languages allow constructs (like C++ templates, which are Turing complete) eventually leading to higher compile times, regardless of compilation optimization. Some languages are harder to parse/compile than others, and so on...
-
Felix Bertoni over 2 yearsPlease note that most interpreters, including JS' V8, feature JIT compilers, allowing them to translate part of the interpreted code into bytecode, or even native code. Native code produced by JITs can rival native code produced by "traditional" compilers, especially in case of simple constructs as a for loop.
-
Peter Cordes over 2 yearsRight, sure, most cases where it's "build once, run once" take less total CPU time with
-Og
than-O3 -flto
. Possible exceptions including small programs that do extensive number-crunching. Spending lots of compile time makes sense for true release builds where its "build once, run many". Same for video encoding: spend more CPU time to save bits at the same quality if that will be amortized over many downloads or keeping the file around on storage long-term.