Processes, threads, green threads, protothreads, fibers, coroutines: what's the difference?

10,370

Solution 1

OK, I'm going to do my best. There are caveats everywhere, but I'm going to do my best to give my understanding of these terms and references to something that approximates the definition I've given.

  • Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
  • Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
  • Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
  • Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
  • Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
  • Coroutines: Exactly fibers, except not OS-managed.
  • Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.

It's also worth noting that there are other understandings in concurrency theory of the term "process", in the process calculus sense. This definition is orthogonal to those above, but I just thought it worth mentioning so that no confusion arises should you see process used in that sense somewhere.

Also, be aware of the difference between parallel and concurrent. It's possible you were using the former in your question where I think you meant the latter.

Solution 2

I mostly agree with Gian's answer, but I have different interpretations of a few concurrency primitives. Note that these terms are often used inconsistently by different authors. These are my favorite definitions (hopefully not too far from the modern consensus).

  • Process:
    • OS-managed
    • Each has its own virtual address space
    • Can be interrupted (preempted) by the system to allow another process to run
    • Can run in parallel with other processes on different processors
    • The memory overhead of processes is high (includes virtual memory tables, open file handles, etc)
    • The time overhead for creating and context switching between processes is relatively high
  • Threads:
    • OS-managed
    • Each is "contained" within some particular process
    • All threads in the same process share the same virtual address space
    • Can be interrupted by the system to allow another thread to run
    • Can run in parallel with other threads on different processors
    • The memory and time overheads associated with threads are smaller than processes, but still non-trivial
      • (For example, typically context switching involves entering the kernel and invoking the system scheduler.)
  • Cooperative Threads:
    • May or may not be OS-managed
    • Each is "contained" within some particular process
    • In some implementations, each is "contained" within some particular OS thread
    • Cannot be interrupted by the system to allow a cooperative peer to run
      • (The containing process/thread can still be interrupted, of course)
    • Must invoke a special yield primitive to allow peer cooperative threads to run
    • Generally cannot be run in parallel with cooperative peers
    • There are lots of variations on the cooperative thread theme that go by different names:
      • Fibers
      • Green threads
      • Protothreads
      • User-level threads (user-level threads can be interruptable/preemptive, but that's a relatively unusual combination)
    • Some implementations of cooperative threads use techniques like split/segmented stacks or even individually heap-allocating every call frame to reduce the memory overhead associated with pre-allocating a large chunk of memory for the stack
    • Depending on the implementation, calling a blocking syscall (like reading from the network or sleeping) will either cause a whole group of cooperative threads to block or implicitly cause the calling thread to yield
  • Coroutines:
    • Some people use "coroutine" and "cooperative thread" more or less synonymously
      • I do not prefer this usage
    • Some coroutine implementations are actually "shallow" cooperative threads; yield can only be invoked by the "coroutine entry procedure"
    • The shallow (or semi-coroutine) version is easier to implement than threads, because each coroutine does not need a complete stack (just one frame for the entry procedure)
    • Often coroutine frameworks have yield primitives that require the invoker to explicitly state which coroutine control should transfer to
  • Generators:
    • Restricted (shallow) coroutines
    • yield can only return control back to whichever code invoked the generator
  • Goroutines:
    • An odd hybrid of cooperative and OS threads
    • Cannot be interrupted (like cooperative threads)
    • Can run in parallel on a language runtime-managed pool of OS threads
  • Event handlers:
    • Procedures/methods that are invoked by an event dispatcher in response to some action happening
    • Very popular for user interface programming
    • Require little to no language/system support; can be implemented in a library
    • At most one event handler can be running at a time; the dispatcher must wait for a handler to finish (return) before starting the next
      • Makes synchronization relatively simple; different handler executions never overlap in time
    • Implementing complex tasks with event handlers tends to lead to "inverted control flow"/"stack ripping"
  • Tasks:
    • Units of work that are doled out by a manager to a pool of workers
    • The workers can be threads, processes or machines
      • Of course the kind of worker a task library uses has a significant impact on how one implements the tasks
    • In this list of inconsistently and confusingly used terminology, "task" takes the crown. Particularly in the embedded systems community, "task" is sometimes used to mean "process", "thread" or "event handler" (usually called an "interrupt service routine"). It is also sometimes used generically/informally to refer to any kind of unit of computation.

One pet peeve that I can't stop myself from airing: I dislike the use of the phrase "true concurrency" for "processor parallelism". It's quite common, but I think it leads to much confusion.

For most applications, I think task-based frameworks are best for parallelization. Most of the popular ones (Intel's TBB, Apple's GCD, Microsoft's TPL & PPL) use threads as workers. I wish there were some good alternatives that used processes, but I'm not aware of any.

If you're interested in concurrency (as opposed to processor parallelism), event handlers are the safest way to go. Cooperative threads are an interesting alternative, but a bit of a wild west. Please do not use threads for concurrency if you care about the reliability and robustness of your software.

Share:
10,370
jameshfisher
Author by

jameshfisher

Working at Pusher. Developer of Vidrio for macOS. Software developer and designer with an M.Sc. (distinction) in C.S. from Imperial College.

Updated on June 06, 2022

Comments

  • jameshfisher
    jameshfisher almost 2 years

    I'm reading up on concurrency. I've got a bit over my head with terms that have confusingly similar definitions. Namely:

    • Processes
    • Threads
    • "Green threads"
    • Protothreads
    • Fibers
    • Coroutines
    • "Goroutines" in the Go language

    My impression is that the distinctions rest on (1) whether truly parallel or multiplexed; (2) whether managed at the CPU, at the OS, or in the program; and (3..5) a few other things I can't identify.

    Is there a succinct and unambiguous guide to the differences between these approaches to parallelism?