Why is Haskell (GHC) so darn fast?

64,892

Solution 1

I agree with Dietrich Epp: it's a combination of several things that make GHC fast.

First and foremost, Haskell is very high-level. This enables the compiler to perform aggressive optimisations without breaking your code.

Think about SQL. Now, when I write a SELECT statement, it might look like an imperative loop, but it isn't. It might look like it loops over all rows in that table trying to find the one that matches the specified conditions, but actually the "compiler" (the DB engine) could be doing an index lookup instead — which has completely different performance characteristics. But because SQL is so high-level, the "compiler" can substitute totally different algorithms, apply multiple processors or I/O channels or entire servers transparently, and more.

I think of Haskell as being the same. You might think you just asked Haskell to map the input list to a second list, filter the second list into a third list, and then count how many items resulted. But you didn't see GHC apply stream-fusion rewrite rules behind the scenes, transforming the entire thing into a single tight machine code loop that does the whole job in a single pass over the data with no allocation — the kind of thing that would be tedious, error-prone and non-maintainable to write by hand. That's only really possible because of the lack of low-level details in the code.

Another way to look at it might be… why shouldn't Haskell be fast? What does it do that should make it slow?

It's not an interpreted language like Perl or JavaScript. It's not even a virtual machine system like Java or C#. It compiles all the way down to native machine code, so no overhead there.

Unlike OO languages [Java, C#, JavaScript…], Haskell has full type erasure [like C, C++, Pascal…]. All type checking happens at compile-time only. So there's no run-time type-checking to slow you down either. (No null-pointer checks, for that matter. In, say, Java, the JVM must check for null pointers and throw an exception if you deference one. Haskell doesn't have to bother with that check.)

You say it sounds slow to "create functions on the fly at run-time", but if you look very carefully, you don't actually do that. It might look like you do, but you don't. If you say (+5), well, that's hard-coded into your source code. It cannot change at run-time. So it's not really a dynamic function. Even curried functions are really just saving parameters into a data block. All the executable code actually exists at compile-time; there is no run-time interpretation. (Unlike some other languages that have an "eval function".)

Think about Pascal. It's old and nobody really uses it any more, but nobody would complain that Pascal is slow. There are plenty of things to dislike about it, but slowness is not really one of them. Haskell isn't really doing that much that's different to Pascal, other than having garbage collection rather than manual memory management. And immutable data allows several optimisations to the GC engine [which lazy evaluation then complicates somewhat].

I think the thing is that Haskell looks advanced and sophisticated and high-level, and everybody thinks "oh wow, this is really powerful, it must be amazingly slow!" But it isn't. Or at least, it isn't in the way you'd expect. Yes, it's got an amazing type system. But you know what? That all happens at compile-time. By run-time, it's gone. Yes, it allows you to construct complicated ADTs with a line of code. But you know what? An ADT is just a plain ordinary C union of structs. Nothing more.

The real killer is lazy evaluation. When you get the strictness / laziness of your code right, you can write stupidly fast code that is still elegant and beautiful. But if you get this stuff wrong, your program goes thousands of times slower, and it's really non-obvious why this is happening.

For example, I wrote a trivial little program to count how many times each byte appears in a file. For a 25KB input file, the program took 20 minutes to run and swallowed 6 gigabytes of RAM! That's absurd!! But then I realized what the problem was, added a single bang-pattern, and the run-time dropped to 0.02 seconds.

This is where Haskell goes unexpectedly slowly. And it sure takes a while to get used to it. But over time, it gets easier to write really fast code.

What makes Haskell so fast? Purity. Static types. Laziness. But above all, being sufficiently high-level that the compiler can radically change the implementation without breaking your code's expectations.

But I guess that's just my opinion...

Solution 2

For a long time it was thought that functional languages couldn't be fast -- and especially lazy functional languages. But this was because their early implementations were, in essence, interpreted and not genuinely compiled.

A second wave of designs emerged based on graph reduction, and opened up the possibility for much more efficient compilation. Simon Peyton Jones wrote about this research in his two books The Implementation of Functional Programming Languages and Implementing functional languages: a tutorial (the former with sections by Wadler and Hancock, and the latter written with David Lester). (Lennart Augustsson also informed me that one key motivation for the former book was describing the way that his LML compiler, which was not extensively commented, accomplished its compilation).

The key notion behind graph reduction approaches such as described in these works is that we do not think of a program as a sequence of instructions, but of a dependency graph which is evaluated through a series of local reductions. The second key insight is that the evaluation of such a graph need not be interpreted but instead the graph itself can be built of code. In particular, we can represent a node of a graph not as "either a value or an 'opcode' and the values to operate on" but instead as a function that when invoked, returns the desired value. The first time it is invoked, it asks the subnodes for their values and then operates on them, and then it overwrites itself with a new instruction that just says "return the result."

This is described in a later paper that lays the basics out for how GHC still works today (though modulo many various tweaks): "Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-Machine.". The current execution model for GHC is documented in more detail at the GHC Wiki.

So the insight is that the strict distinction of "data" and "code" which we think of as "fundamental" to how machines work is not how they must work, but is imposed by our compilers. So we can throw that out, and have code (a compiler) that generates self-modifying code (the executable) and it can all work quite nicely.

Thus it turns out that while machine architectures are imperative in a certain sense, languages may map to them in very surprising ways that don't look like conventional C-style flow control, and if we think low-level enough, this may also be efficient.

On top of this there are many other optimizations opened up by purity in particular, as it allows a greater range of "safe" transformations. When and how to apply these transformations such that they make things better and not worse is of course an empirical question, and on this and many other small choices, years of work have been put into both theoretical work and practical benchmarking. So this of course plays a role as well. A paper that provides a good example of this sort of research is "Making a Fast Curry: Push/Enter vs. Eval/Apply for Higher-Order Languages."

Finally, it should be noted that this model still introduces an overhead due to indirections. This can be avoided in cases where we know that it is "safe" to do things strictly and thus elide graph indirections. The mechanisms that infer strictness/demand are again documented in some detail at the GHC Wiki.

Solution 3

Well, there's a lot to comment on here. I'll try to answer as much as I can.

Used correctly, it can get close-ish to low-level languages.

In my experience, it's usually possible to get within 2x the performance of Rust in many cases. But there are also some (broad) use cases where performance is poor compared to low-level languages.

or even beat it, but that means you are using an inefficient C program, since GHC compiles Haskell to C)

That is not entirely correct. Haskell compiles to C-- (a subset of C), which is then compiled via the native code generator to assembly. The native code generator usually generates faster code than the C compiler, because it can apply some optimizations that an ordinary C compiler can't.

Machine architectures are clearly imperative, being based on turing machines, roughly.

That's not a good way to think about it, particularly since modern processors will evaluate instructions out of order and possibly at the same time.

Indeed, Haskell doesn't even have a specific evaluation order.

Actually, Haskell does implicitly define an evaluation order.

Also, instead of dealing with machine data types, you make algebraic data types all the time.

They correspond in many cases, provided you have a sufficiently advanced compiler.

You would think that creating functions on the fly, and throwing them around, would make a program slower.

Haskell is compiled, and so higher-order functions are not actually created on the fly.

it seems to optimize Haskell code, you need to make it more elegant and abstract, instead of more machine like.

In general, making code more "machine like" is an unproductive way to get better performance in Haskell. But making it more abstract is not always a good idea either. What is a good idea is using common data structures and functions that have been heavily optimized (such as linked lists).

f x = [x] and f = pure are the exact same thing in Haskell, for instance. A good compiler would not yield better performance in the former case.

Why is Haskell (compiled with GHC) so fast, considering its abstract nature and differences from physical machines?

The short answer is "because it was designed to do exactly that." GHC uses the spineless tagless g-machine (STG). You can read a paper about it here (it's quite complex). GHC does a lot of other things as well, such as strictness analysis and optimistic evaluation.

The reason I say C and other imperative languages are somewhat similar to Turing Machines (but not to the extent that Haskell is similar to Lambda Calculus) is that in an imperative language, you have a finite number of states (a.k.a. line number), along with a Tape (the ram), such that the state and the current tape determine what to do to the tape.

Is the point of confusion then that mutability should lead to slower code? Haskell's laziness actually means that mutability doesn't matter as much as you think it would, plus it's high-level so there are many optimizations the compiler can apply. Thus, modifying a record in-place will rarely be slower than it would in a language such as C.

Share:
64,892
PyRulez
Author by

PyRulez

I enjoy programming in Haskell. I also have python experience. Category theory is cool.

Updated on July 09, 2022

Comments

  • PyRulez
    PyRulez almost 2 years

    Haskell (with the GHC compiler) is a lot faster than you'd expect. Used correctly, it can get close-ish to low-level languages. (A favorite thing for Haskellers to do is to try and get within 5% of C (or even beat it, but that means you are using an inefficient C program, since GHC compiles Haskell to C).) My question is, why?

    Haskell is declarative and based on lambda calculus. Machine architectures are clearly imperative, being based on turing machines, roughly. Indeed, Haskell doesn't even have a specific evaluation order. Also, instead of dealing with machine data types, you make algebraic data types all the time.

    Weirdest of all though is higher order functions. You would think that creating functions on the fly, and throwing them around, would make a program slower. But using higher order functions actually makes Haskell faster. Indeed, it seems that, to optimize Haskell code, you need to make it more elegant and abstract instead of more machine-like. None of Haskell's more advanced features seem to even affect its performance, if they don't improve it.

    Sorry if this is sounding ranty, but here is my question: Why is Haskell (compiled with GHC) so fast, considering its abstract nature and differences from physical machines?

    Note: The reason I say C and other imperative languages are somewhat similar to Turing Machines (but not to the extent that Haskell is similar to Lambda Calculus) is that in an imperative language, you have a finite number of states (a.k.a. line number), along with a Tape (the ram), such that the state and the current tape determine what to do to the tape. See the Wikipedia entry, Turing machine equivalents, for the transition from Turing Machines to computers.

    • viraptor
      viraptor over 8 years
      "since GHC compiles Haskell to C" - it doesn't. GHC has multiple backends. The oldest one (but not the default) is a C generator. It does generate Cmm code for IR, but that's not "compiling to C" you'd normally expect. (downloads.haskell.org/~ghc/latest/docs/html/users_guide/…)
    • Kaz
      Kaz over 8 years
      Maybe you have lots of memory and haven't run anything long enough to need garbage collection to kick in. That's a time/space tradeoff: cons it up, and then terminate, letting the OS clean up the pages.
    • PyRulez
      PyRulez over 8 years
      @viraptor it compiles to assembly?
    • viraptor
      viraptor over 8 years
      It compiles to whatever the backend wants. By default, yes - assembly. llvm backend produces llvm bytecode. ghcjs produces javascript. So in practice - whatever output you backend/choose.
    • Dietrich Epp
      Dietrich Epp over 8 years
      "Compiles to assembly"... well, eventually. GHC produces Cmm, LLVM bitcode, or C, and those get converted to assembly. The code that converts those IRs to assembly are not actually part of the GHC project itself, maybe with the exception of Cmm (not sure).
    • Daniel Wagner
      Daniel Wagner over 8 years
      "Machine architectures are... based on Turing machines" is a bit of an odd claim to make. RAM isn't really modelled very well by Turing machines. But even if we accept the premise, it doesn't seem to connect to anything; e.g. I don't see a reason to think something C-like is the obvious programming model for Turing machines.
    • mcandre
      mcandre over 8 years
      Because Haskell curries.
    • Joe Hillenbrand
      Joe Hillenbrand over 8 years
      I highly recommend reading Implementation of Functional Programming Languages by Simon Payton Jones (the primary implementer of GHC) it will answer a lot of your questions.
    • Zeta
      Zeta over 8 years
      @PyRulez: Not only does it compile to native code via LLVM-IR or Cmm, but the compilation via C is disabled in your usual distributed GHC.
    • augustss
      augustss over 8 years
      Why? 25 years of hard work.
    • cimmanon
      cimmanon over 8 years
      Getting a lot of support from folks on Reddit does not make this question on-topic. This question is overly broad and even though there might be a factual answer to it, it will do nothing but solicit opinions.
    • sclv
      sclv over 8 years
      "Even though there might be a factual answer to it, it will do nothing but solicit opinions." -- This is the worst possible reason to close a question. Because it may have a good answer, but it will potentially also attract low-quality ones. Yuck! I so happen to have a good, historical, factual answer lined up about academic research and when certain developments happened. But I can't post it because people are worried this question may also attract low-quality answers. Again, yuck.
    • cimmanon
      cimmanon over 8 years
      @cslv Note that this question has been closed twice now as "too broad", as in you would need a book or several blog posts to answer this question (on top of pointing out the OP's faulty understanding). I fail to see how this question can be considered anything other than too broad.
    • sclv
      sclv over 8 years
      @cimmanon I would need a month or several blog posts to walk through the basics of the details of how a functional compiler works. I only need an SO answer to sketch in broad outline how a graph machine can be implemented cleanly on stock hardware and point to the relevant sources for further reading...
    • Veedrac
      Veedrac over 8 years
      "None of Haskell's more advanced features seem to even affect its performance" → This isn't true. Eg. take 5 . filter (not . p) . drop 3 was slower than manually writing it out by a factor of 3 for me. That's not to say that Haskell isn't very clever at compiling away abstractions, but it's not immune to overhead.
    • Veedrac
      Veedrac over 8 years
      FWIW, that quotes the Computer Language Benchmarks Game, which isn't really credible for this argument. First of all, the code on that site is not written in typical high-level idiomatic Haskell. Secondly, the benchmarks are really badly controlled, such that certain Haskell programs take very large liberties with what is considered a reasonable interpretation of the rules (other languages are not immune to this, but IMO Haskell is the worst offender).
    • Alexei Averchenko
      Alexei Averchenko over 4 years
      It's not fast, in fact it's pitifully slow unless you take pains to remove laziness from the hot paths. I challenge you to write one CPU-bound Haskell program that I won't be able to speed up by the factor of 10 by rewriting it in C++ and doing relatively simple source-level optimizations.
  • MathematicalOrchid
    MathematicalOrchid over 8 years
    @cimmanon I don't think it's purely opinion-based. It's an interesting question that other people have probably wanted an answer to. But I guess we'll see what other voters think.
  • cimmanon
    cimmanon over 8 years
    You do understand that Haskell questions are the butt of every joke regarding moderating questions here on SO, right? Overly broad or opinionated questions are welcomed by Haskellers, when they would be rightly closed in every other tag. On top of that, this question is getting a lot of support from Reddit (ie. people who don't even participate in this community).
  • sclv
    sclv over 8 years
    @cimmanon -- that search only gives one and a half threads, and they all have to do with review audits. and the upvoted answer to the thread says "please stop moderating stuff you don't understand." I would suggest that if someone thinks the answer to this is necessarily too broad then they would be surprised by and enjoy the answer, since the answer is not too broad.
  • Will Ness
    Will Ness over 8 years
    @sclv it is the question that is too broad (not opinion-based, etc.). an answer to too broad a question is necessarily too narrow. :)
  • sclv
    sclv over 8 years
    @WillNess -- this is a catch 22. i claim the question is not too broad because it lends itself to a relatively succinct answer. but you claim that it cannot be answered by a relatively succinct answer. i would respond, that just because you cannot imagine such an answer does not mean that this answer does not exist. it just means you might learn something from it. which is the point of stack overflow, i thought.
  • Will Ness
    Will Ness over 8 years
    @sclv btw I disagree with the Q's premise. I know for a fact, on a well known test case of primes calculation, that you only get within 5% of C with a low-level code, and a proper high-level list-based code runs about 10x - 20x slower. so all the marketing hype about purity etc., is just that. Of course it is still much faster than other FP languages doing the same; still it shows how ill-defined and vague the Q's terms are. IOW the Q itself necessitates much discussion.
  • Boann
    Boann over 8 years
    "In, say, Java, the JVM must check for null pointers and throw an exception if you deference one." Java's implicit null check is (mostly) costless. Java implementations can and do take advantage of virtual memory to map the null address to a missing page, so dereferencing a null pointer triggers a page fault at the CPU level, which Java catches and throws as a high-level exception. So most null checks get done by the memory mapping unit in the CPU, for free.
  • MathematicalOrchid
    MathematicalOrchid over 8 years
    @Boann That's quite interesting. I did not know that.
  • Tarc
    Tarc over 8 years
    @MathematicalOrchid, it would be nice if you elaborate more on your byte counter example (say in a note). I got curious about what was that have penalized you so much (6GB, 20 minutes!)
  • Evi1M4chine
    Evi1M4chine over 7 years
    That demand analyzer link is worth its weight in gold! Finally something about the topic that isn’t acting as if it’s basically inexplicable black magic. How have I never heard of this?? It should be linked from everywhere where anyone asks how to tackle the problems with laziness!
  • George
    George over 7 years
    @MathematicalOrchid: do you have a copy of your original program that took 20 min to run? I think it would be quite instructive to study why it's so slow.
  • semicolon
    semicolon about 7 years
    @MathematicalOrchid I too would like to see the original program. I know laziness can bite sometimes but that is a somewhat impressive performance change.
  • Cris P
    Cris P almost 7 years
    @Evi1M4chine I do not see a link related to a demand analyzer, perhaps it has been lost somehow. Can someone restore the link or clarify the reference? It sounds quite interesting.
  • Serp C
    Serp C over 6 years
    @CrisP I believe the last link is what is being referred to. It goes to a page on the GHC Wiki about the demand analyzer in GHC.
  • Evi1M4chine
    Evi1M4chine over 6 years
    @Serpentine Cougar, Chris P: Yep, It’s what I meant.
  • schulmaster
    schulmaster over 6 years
    I wrote a hashtable based version of the bytecounter in C++, in 7 lines, two of which were unnecessary, and it profiled in at 8.1ms, on a 28KB file. I too am curious at how you managed 20min in a high-level language catered to aggressive optimization, and why even when correctly banged,it is not competitive with a modern lower level language.
  • kravitz
    kravitz about 6 years
    Agreed with LE problem. As a part of Computer Science course I tried to implement Huffman code compression using Haskell, the algorithm was simple two pass encoding (1st pass generates coding tree/alphabet, 2nd pass does actual compression). And this 2-pass thing shot me in a leg hard. Since you can't tell Haskell runtime directly what stuff you don't want to store, the things become tricky. That time bang-patterns did no help to me, but keeping the bytestream strictly in IO monad actually helped. Took me two days to solve, but finally it was solved only after asking SO for help.
  • Shayne
    Shayne almost 5 years
    I do actually remember it used to be a common accusation that Pascal was slow. It was however based on earlier Pascal implementations that ran off "P code" a sort of byte code intermediary, which actually made it closer to, say, pre JIT java, and in the early 80s that was straight up too slow. However Turbo Pascal and a number of non-academia compilers where genuine honest to goodness machine code compilers and that gave Turbo Pascal blinding speed, on par with Turbo C/C++ (which it shared its backend with). Everyone agreed Borland C was fast. But Borland Pascal was as fast. same compiler!
  • Jonas Kölker
    Jonas Kölker over 4 years
    "Haskell is very high-level [which lets the compiler optimize well]". Could you define high-level-ness and explain what about it makes the compiler able to optimize better? Are scheme, python and javascript equally high-level? Are they equally optimizeable? If not, why not? Here's my take: the more execution strategies a program admits, the more likely it is that one will be fast (and the compiler people will have found a fast one). Pure code (Haskell and read-only SQL queries) admit many strategies. Side effects which have to be sequenced shrink this space and are thus harder to optimize.
  • MathematicalOrchid
    MathematicalOrchid over 3 years
    @George I don't still have the program. Looking back at it, I think the actual killer was that the program's working set was 6GB, and the machine only had 128MB of physical RAM. That's going to be slow in any programming language... (Should also tell you how long ago all this happened!)