Technically, why are processes in Erlang more efficient than OS threads?

multithreading erlang

33,595

Solution 1

There are several contributing factors:

Erlang processes are not OS processes. They are implemented by the Erlang VM using a lightweight cooperative threading model (preemptive at the Erlang level, but under the control of a cooperatively scheduled runtime). This means that it is much cheaper to switch context, because they only switch at known, controlled points and therefore don't have to save the entire CPU state (normal, SSE and FPU registers, address space mapping, etc.).
Erlang processes use dynamically allocated stacks, which start very small and grow as necessary. This permits the spawning of many thousands — even millions — of Erlang processes without sucking up all available RAM.
Erlang used to be single-threaded, meaning that there was no requirement to ensure thread-safety between processes. It now supports SMP, but the interaction between Erlang processes on the same scheduler/core is still very lightweight (there are separate run queues per core).

Solution 2

After some more research I found a presentation by Joe Armstrong.

From Erlang - software for a concurrent world (presentation) (at 13 min):

[Erlang] is a concurrent language – by that I mean that threads are part of the programming language, they do not belong to the operating system. That's really what's wrong with programming languages like Java and C++. It's threads aren't in the programming language, threads are something in the operating system – and they inherit all the problems that they have in the operating system. One of the problems is granularity of the memory management system. The memory management in the operating system protects whole pages of memory, so the smallest size that a thread can be is the smallest size of a page. That's actually too big.

If you add more memory to your machine – you have the same number of bits that protects the memory so the granularity of the page tables goes up – you end up using say 64kB for a process you know running in a few hundred bytes.

I think it answers if not all, at least a few of my questions

Solution 3

I've implemented coroutines in assembler, and measured performance.

Switching between coroutines, a.k.a. Erlang processes, takes about 16 instructions and 20 nanoseconds on a modern processor. Also, you often know the process you are switching to (example: a process receiving a message in its queue can be implemented as straight hand-off from the calling process to the receiving process) so the scheduler doesn't come into play, making it an O(1) operation.

To switch OS threads, it takes about 500-1000 nanoseconds, because you're calling down to the kernel. The OS thread scheduler might run in O(log(n)) or O(log(log(n))) time, which will start to be noticeable if you have tens of thousands, or even millions of threads.

Therefore, Erlang processes are faster and scale better because both the fundamental operation of switching is faster, and the scheduler runs less often.

Solution 4

Erlang processes correspond (approximately) to green threads in other languages; there's no OS-enforced separation between the processes. (There may well be language-enforced separation, but that's a lesser protection despite Erlang doing a better job than most.) Because they're so much lighter-weight, they can be used far more extensively.

OS threads on the other hand are able to be simply scheduled on different CPU cores, and are (mostly) able to support independent CPU-bound processing. OS processes are like OS threads, but with much stronger OS-enforced separation. The price of these capabilities is that OS threads and (even more so) processes are more expensive.

Another way to understand the difference is this. Supposing you were going to write an implementation of Erlang on top of the JVM (not a particularly crazy suggestion) then you'd make each Erlang process be an object with some state. You'd then have a pool of Thread instances (typically sized according to the number of cores in your host system; that's a tunable parameter in real Erlang runtimes BTW) which run the Erlang processes. In turn, that will distribute the work that is to be done across the real system resources available. It's a pretty neat way of doing things, but relies utterly on the fact that each individual Erlang process doesn't do very much. That's OK of course; Erlang is structured to not require those individual processes to be heavyweight since it is the overall ensemble of them which execute the program.

In many ways, the real problem is one of terminology. The things that Erlang calls processes (and which correspond strongly to the same concept in CSP, CCS, and particularly the π-calculus) are simply not the same as the things that languages with a C heritage (including C++, Java, C#, and many others) call a process or a thread. There are some similarities (all involve some notion of concurrent execution) but there's definitely no equivalence. So be careful when someone says “process” to you; they might understand it to mean something utterly different…

Solution 5

I think Jonas wanted some numbers on comparing OS threads to Erlang processes. The author of Programming Erlang, Joe Armstrong, a while back tested the scalability of the spawning of Erlang processes to OS threads. He wrote a simple web server in Erlang and tested it against multi-threaded Apache (since Apache uses OS threads). There's an old website with the data dating back to 1998. I've managed only to find that site exactly once. So I can't supply a link. But the information is out there. The main point of the study showed that Apache maxed out just under 8K processes, while his hand written Erlang server handled 10K+ processes.

View more solutions

33,595

Jonas

Passionated Software Developer interested in Distributed Systems

Updated on February 19, 2022

Comments

Jonas about 2 years
Erlang's Characteristics

From Erlang Programming (2009):

Erlang concurrency is fast and scalable. Its processes are lightweight in that the Erlang virtual machine does not create an OS thread for every created process. They are created, scheduled, and handled in the VM, independent of underlying operating system. As a result, process creation time is of the order of microseconds and independent of the number of concurrently existing processes. Compare this with Java and C#, where for every process an underlying OS thread is created: you will get some very competitive comparisons, with Erlang greatly outperforming both languages.

From Concurrency oriented programming in Erlang (pdf) (slides) (2003):

We observe that the time taken to create an Erlang process is constant 1µs up to 2,500 processes; thereafter it increases to about 3µs for up to 30,000 processes. The performance of Java and C# is shown at the top of the figure. For a small number of processes it takes about 300µs to create a process. Creating more than two thousand processes is impossible.

We see that for up to 30,000 processes the time to send a message between two Erlang processes is about 0.8µs. For C# it takes about 50µs per message, up to the maximum number of processes (which was about 1800 processes). Java was even worse, for up to 100 process it took about 50µs per message thereafter it increased rapidly to 10ms per message when there were about 1000 Java processes.

My thoughts

I don't fully understand technically why Erlang processes are so much more efficient in spawning new processes and have much smaller memory footprints per process. Both the OS and Erlang VM have to do scheduling, context switches, and keep track of the values in the registers and so on...

Simply why aren't OS threads implemented in the same way as processes in Erlang? Do they have to support something more? And why do they need a bigger memory footprint? And why do they have slower spawning and communication?

Technically, why are processes in Erlang more efficient than OS threads when it comes to spawning and communication? And why can't threads in the OS be implemented and managed in the same efficient way? And why do OS threads have a bigger memory footprint, plus slower spawning and communication?

More reading
- T.J. Crowder about 14 years
  
  Before attempting to understand the reason why a hypothesis is true, you need to establish whether the hypothesis is true -- e.g., supported by the evidence. Do you have references for any like-for-like comparisons demonstrating that an Erlang process actually is more efficient than (say) a Java thread on an up-to-date JVM? Or a C app using OS process and thread support directly? (The latter seems very, very unlikely to me. The former only somewhat likely.) I mean, with a limited enough environment (Francisco's point), it might be true, but I'd want to see the numbers.
- Donal Fellows about 14 years
  
  The nature of the Erlang language means that it needs very few shared resources. In turn, that means less lock contention so it's easier to make things go fast. The language runtime is also able to use a mix of green threads running on native threads, so it can support a lot of lightweight processes, but they're much less capable in some senses than Java's threads. Anyone saying one is universally better than the other is full of it...
- T.J. Crowder about 14 years
  
  @Donal: As is the case with so many other absolute statements. :-)
- Jonas about 14 years
  
  @T.J. Crowder: Here are some numbers between Java and Erlang: sics.se/~joe/ericsson/du98024.html
- T.J. Crowder about 14 years
  
  @Jonas: Thanks, but I got as far as the date (1998-11-02) and JVM version (1.1.6) and stopped. Sun's JVM has improved a fair bit in the last 11.5 years (and presumably Erlang's interpreter has as well), particularly in the area of threading. (Just to be clear, I'm not saying that the hypothesis isn't true [and Francisco and Donal have pointed out why Erland may be able to do something there]; I'm saying it shouldn't be taken at face value without being checked.)
- Jonas about 14 years
  
  @T.J. Crowder: Yes, it was old, didn't find a newer. But all source code was there, so the benchmark can be run again. The article also gives a hint of why Erlang is fast. You could always try to spawn 1 million of threads in Java on your computer, I don't think you even can do it, but I guess you can do it in Erlang, and very quickly.
- T.J. Crowder about 14 years
  
  @Jonas: "...but I guess you can do it in Erlang..." It's that "guess" part, dude. :-) You're guessing that Erlang's process switching scales up past the thousands. You're guessing that it does so better than Java or OS threads. Guessing and software dev are not a great combination. :-) But I think I've made my point.
- Jonas about 14 years
  
  @T.J. Crowder: I know that Erlang can do this much better than Java, but I have to guess that you can spawn exactly 1 million processes in Erlang, since it depends on what hardware (i.e. CPU and memory) you have. But fair enough, I could have formulated it differently to be more clear.
- T.J. Crowder about 14 years
  
  @Jonas: Not arguing, genuinely asking : How do you know that? If the only quantitative data you have is from a completely out-dated article?
- Donal Fellows about 14 years
  
  @Jonas: Erlang processes are about communication with other processes. That's how the language is designed, that's what it is optimized to support. (And that's great! It does what it does excellently.) But the concepts it uses do not map exactly to other languages. In particular, if I was implementing Erlang on top of the JVM, I wouldn't be mapping an Erlang process to a Java thread; the latter are not as lightweight. It's a fundamental difference in concept too; Java's simply designed to be optimized for a different processing model.
- Donal Fellows about 14 years
  
  To be exact, Java copes with CPU-heavy processing far better (using a shared memory model), whereas Erlang is more suited for enormous numbers of communicating but ultimately simple processing elements (using a model closely related to CSP, CCS or the π-calculus).
- Jonas about 14 years
  
  @T.J. Crowder: I have updated my question now, and reformulated it to remove guessing. Thanks for your comment. I hope it looks better now.
- Shruti Singh about 14 years
  
  @T.J. Crowder: Install erlang and run erl +P 1000100 +hms 100 and than type {_, PIDs} = timer:tc(lists,map,[fun(_)->spawn(fun()->receive stop -> ok end end) end, lists:seq(1,1000000)]). and than wait about three minutes for result. That's so simple. It takes 140us per process and 1GB whole RAM on mine laptop. But it is directly form shell, it should be better from compiled code.
Jonas almost 14 years

Related: stackoverflow.com/questions/2267545/…
I GIVE TERRIBLE ADVICE almost 14 years

Erlang doesn't get anywhere close to Pi Calculus. Pi calculus assumes synchronous events over channels that can be bound to variables. This kind of concept doesn't fit the Erlang model at all. Try Join Calculus, Erlang's closer to that although it still needs to be able to natively join on some messages and whatnot. There was a thesis paper (and project) named JErlang dedicated which implemented it.
Donal Fellows almost 14 years

It all depends on what exactly you view the pi-calculus to be (and you can model asynchronous channels with synchronous channels plus buffer processes).
Jonas almost 14 years

You are just saying that Erlang processes are lightweight but you are not explaining why they have a smaller footprint (are lightweight) and why they have better performance than OS threads.
Donal Fellows almost 14 years

@Jonas: For some types of task (particularly computation-heavy tasks) OS threads do better. Mind you, those are not typically tasks for which Erlang is used; Erlang is focussed towards having large numbers of simple communicating tasks. One of the gains from doing that is that in the case of a group of tasks that hand a piece of work around and wait for the result, that can all be done in a single OS thread on a single processor, which is more efficient than having context switches.
Donal Fellows almost 14 years

Theoretically, you could make an OS thread be very cheap too through using a very small stack and carefully controlling the number of other thread-specific resources allocated, but that's quite problematic in practice. (Predicting stack requirements is a bit of a black art.) So instead OS threads are particularly designed to be optimal in the case where there are fewer of them (of the order of the number of CPU cores) and where they are doing more significant amounts of processing each.
Donal Fellows almost 14 years

And "best" is a slippery thing. It massively depends on what you're doing; give me a metric and I'll tell you which is "best". :-)
user2508301 almost 14 years

To your 2nd point: And if the process has not run yet, there is no reason to have stack allocated for it. In addition: Several tricks can be played by fiddling with the GC of a process such that it never collects memory. But that is advanced and somewhat dangerous :)
Jonas over 13 years

I think you are talking about this one: sics.se/~joe/apachevsyaws.html But I asked how erlang make threads so efficient compared to kerlenl threads.
nilskp about 12 years

To your 3rd point: Erlang enforces immutable data, so introducing SMP should not affect thread-safety.
liuyang1 over 11 years

@nilskp,that's right,erlang is also a functional programming language.So there is not "variable" data.this lead to thread-safety.
Alex S over 11 years

@nilskp: (RE: you comment on point 3…) Even though the language itself has an immutable type system, the underlying implementation — message passing, scheduler, etc. — is a different story altogether. Correct and efficient SMP support didn't just happen with the flick of a switch.
Alex S about 10 years

@rvirding: Thanks for the clarifying addendum. I've taken the liberty to integrate your points into the body of my answer.
alvaro g about 8 years

@Jonas link is dead. Last snapshot is here
Thanatos about 8 years

The memory protection on stacks is there for a reason. Does Erlang just not protect the stacks of different execution contexts via the processor's MMU? (And just hope for the best?) What if a thread uses more than its tiny stack? (Are all stack allocations checked to see if a larger stack is needed? Is the stack movable?)
Jörg W Mittag over 7 years

@Thanatos: Erlang doesn't allow programs to access memory or fiddle with the stack. All allocations must go through the managed runtime, both heap and stack. In other words: hardware protection is useless because it protects against things that cannot happen anyway. The language is pointer-safe, stack-safe, memory-safe, and type-safe. A process cannot use more than its "tiny stack" because the stack grows as needed. You may think of it as the opposite of tiny: infinitely large. (But lazily allocated.)
Jörg W Mittag over 7 years

You should take a look at the Singularity Operating System by Microsoft Research. In Singularity, all code, kernel, device drivers, libraries, and user programs runs in ring 0 with full kernel privileges. All code, kernel, device drivers, libraries, and user programs runs in a single flat physical address space with no memory protection whatsoever. The team found that the guarantees the language makes are much stronger than the guarantees the MMU can make, and at the same time using the MMU cost them up to 30%(!!!) in performance. So, why use the MMU if your language already does it anyway?
Jörg W Mittag over 7 years

The OS/400 Operating System works the same way. There is only a single flat address space for all programs. And most of the languages in actual use today have the same safety properties (ECMAScript, Java, C♯, VB.NET, PHP, Perl, Python, Ruby, Clojure, Scala, Kotlin, Groovy, Ceylon, F♯, OCaml, the "Objective" part of "Objective-C", the "++" part of "C++"). If it weren't for legacy C code, and legacy features of C++ and Objective-C, we wouldn't even need virtual memory anymore.
David Brown over 6 years

I think a lot of beginners get tripped up by Erlang's use of the term "process" when it doesn't correspond to OS processes. It's the same with Lua, which uses the term "thread" to describe what are actually cooperatively-scheduled coroutines that have no relation to OS threads at all. So what other term should one use to describe the concept of Erlang's processes? Agent? Unit?
Nathan Long over 5 years

The article said: "Apache dies at about 4,000 parallel sessions. Yaws is still functioning at over 80,000 parallel connections."
Bernhard about 5 years

see the full article at citeseerx.ist.psu.edu/viewdoc/… Indeed, it proved impossible to break the Erlang server using 16 attacking machines - though it was easy to stop the Apache server.