Is non-blocking I/O really faster than multi-threaded blocking I/O? How?

multithreading io blocking nonblocking

44,074

Solution 1

The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).

Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:

One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
One thread for all connections.
This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
A few threads where each thread handles some of the connections.
This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.

Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).

On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.

The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.

Some more reading on the web:

Multiple Threads in the User Interface on MSDN, also shortly handling the cost of creating threads
Section Threads and Thread Pools says "Although threads are relatively easy to create and use, the operating system allocates a significant amount of time and other resources to manage them."
CreateThread documentation on MSDN says "However, your application will have better performance if you create one thread per processor and build queues of requests for which the application maintains the context information.".
Old article Why Too Many Threads Hurts Performance, and What to do About It

Solution 2

I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.

At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.

If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.

So, as @user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.

One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.

You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.

Solution 3

To presume a speed improvement due to any form of multi-computing you must presume either that multiple CPU-based tasks are being executed concurrently upon multiple computing resources (generally processor cores) or else that not all of the tasks rely upon the concurrent usage of the same resource -- that is, some tasks may depend on one system subcomponent (disk storage, say) while some tasks depend on another (receiving communication from a peripheral device) and still others may require usage of processor cores.

The first scenario is often referred to as "parallel" programming. The second scenario is often referred to as "concurrent" or "asynchronous" programming, although "concurrent" is sometimes also used to refer to the case of merely allowing an operating system to interleave execution of multiple tasks, regardless of whether such execution must take place serially or if multiple resources can be used to achieve parallel execution. In this latter case, "concurrent" generally refers to the way that execution is written in the program, rather than from the perspective of the actual simultaneity of task execution.

It's very easy to speak about all of this with tacit assumptions. For example, some are quick to make a claim such as "Asynchronous I/O will be faster than multi-threaded I/O." This claim is dubious for several reasons. First, it could be the case that some given asynchronous I/O framework is implemented precisely with multi-threading, in which case they are one in the same and it doesn't make sense to say one concept "is faster than" the other.

Second, even in the case when there is a single-threaded implementation of an asynchronous framework (such as a single-threaded event loop) you must still make an assumption about what that loop is doing. For example, one silly thing you can do with a single-threaded event loop is request for it to asynchronously complete two different purely CPU-bound tasks. If you did this on a machine with only an idealized single processor core (ignoring modern hardware optimizations) then performing this task "asynchronously" wouldn't really perform any differently than performing it with two independently managed threads, or with just one lone process -- the difference might come down to thread context switching or operating system schedule optimizations, but if both tasks are going to the CPU it would be similar in either case.

It is useful to imagine a lot of the unusual or stupid corner cases you might run into.

"Asynchronous" does not have to be concurrent, for example just as above: you "asynchronously" execute two CPU-bound tasks on a machine with exactly one processor core.

Multi-threaded execution doesn't have to be concurrent: you spawn two threads on a machine with a single processor core, or ask two threads to acquire any other kind of scarce resource (imagine, say, a network database that can only establish one connection at a time). The threads' execution might be interleaved however the operating system scheduler sees fit, but their total runtime cannot be reduced (and will be increased from the thread context switching) on a single core (or more generally, if you spawn more threads than there are cores to run them, or have more threads asking for a resource than what the resource can sustain). This same thing goes for multi-processing as well.

So neither asynchronous I/O nor multi-threading have to offer any performance gain in terms of run time. They can even slow things down.

If you define a specific use case, however, like a specific program that both makes a network call to retrieve data from a network-connected resource like a remote database and also does some local CPU-bound computation, then you can start to reason about the performance differences between the two methods given a particular assumption about hardware.

The questions to ask: How many computational steps do I need to perform and how many independent systems of resources are there to perform them? Are there subsets of the computational steps that require usage of independent system subcomponents and can benefit from doing so concurrently? How many processor cores do I have and what is the overhead for using multiple processors or threads to complete tasks on separate cores?

If your tasks largely rely on independent subsystems, then an asynchronous solution might be good. If the number of threads needed to handle it would be large, such that context switching became non-trivial for the operating system, then a single-threaded asynchronous solution might be better.

Whenever the tasks are bound by the same resource (e.g. multiple needs to concurrently access the same network or local resource), then multi-threading will probably introduce unsatisfactory overhead, and while single-threaded asynchrony may introduce less overhead, in such a resource-limited situation it too cannot produce a speed-up. In such a case, the only option (if you want a speed-up) is to make multiple copies of that resource available (e.g. multiple processor cores if the scarce resource is CPU; a better database that supports more concurrent connections if the scarce resource is a connection-limited database, etc.).

Another way to put it is: allowing the operating system to interleave the usage of a single resource for two tasks cannot be faster than merely letting one task use the resource while the other waits, then letting the second task finish serially. Further, the scheduler cost of interleaving means in any real situation it actually creates a slowdown. It doesn't matter if the interleaved usage occurs of the CPU, a network resource, a memory resource, a peripheral device, or any other system resource.

Solution 4

The main reason to use AIO is for scalability. When viewed in the context of a few threads, the benefits are not obvious. But when the system scales to 1000s of threads, AIO will offer much better performance. The caveat is that AIO library should not introduce further bottlenecks.

Solution 5

One possible implementation of non-blocking I/O is exactly what you said, with a pool of background threads that do blocking I/O and notify the thread of the originator of the I/O via some callback mechanism. In fact, this is how the AIO module in glibc works. Here are some vague details about the implementation.

While this is a good solution that is quite portable (as long as you have threads), the OS is typically able to service non-blocking I/O more efficiently. This Wikipedia article lists possible implementations besides the thread pool.

View more solutions

44,074

Author by

yankee

Updated on July 08, 2022

Comments

yankee almost 2 years
I searched the web on some technical details about blocking I/O and non blocking I/O and I found several people stating that non-blocking I/O would be faster than blocking I/O. For example in this document.

If I use blocking I/O, then of course the thread that is currently blocked can't do anything else... Because it's blocked. But as soon as a thread starts being blocked, the OS can switch to another thread and not switch back until there is something to do for the blocked thread. So as long as there is another thread on the system that needs CPU and is not blocked, there should not be any more CPU idle time compared to an event based non-blocking approach, is there?

Besides reducing the time the CPU is idle I see one more option to increase the number of tasks a computer can perform in a given time frame: Reduce the overhead introduced by switching threads. But how can this be done? And is the overhead large enough to show measurable effects? Here is an idea on how I can picture it working:
1. To load the contents of a file, an application delegates this task to an event-based i/o framework, passing a callback function along with a filename
2. The event framework delegates to the operating system, which programs a DMA controller of the hard disk to write the file directly to memory
3. The event framework allows further code to run.
4. Upon completion of the disk-to-memory copy, the DMA controller causes an interrupt.
5. The operating system's interrupt handler notifies the event-based i/o framework about the file being completely loaded into memory. How does it do that? Using a signal??
6. The code that is currently run within the event i/o framework finishes.
7. The event-based i/o framework checks its queue and sees the operating system's message from step 5 and executes the callback it got in step 1.
Is that how it works? If it does not, how does it work? That means that the event system can work without ever having the need to explicitly touch the stack (such as a real scheduler that would need to backup the stack and copy the stack of another thread into memory while switching threads)? How much time does this actually save? Is there more to it?
supercat almost 11 years

I wonder if it would be worth making a distinction between I/O operations that are expected to complete, and things that might not [e.g. "get next character that arrives on a serial port", in cases where the remote device may or may not send anything]. If an I/O operation is expected to complete within a reasonable time, one might delay cleanup of the related resources until the operation completes. If the operation might never complete, however, such delay would be unreasonable.
Florin Dumitrescu almost 11 years

@supercat the scenario you are describing is used in lower level applications and libraries. Servers are relying on it, since they continuously wait for incoming connections. Async I/O as described above cannot fit in this scenario because it is based on starting a specific operation and registering a callback for its completion. In the case you are describing you need to register a callback on a system event and process every notification. You are continuously processing input rather than performing operations. As said, this is usually done at low level, almost never in your apps.
supercat almost 11 years

The pattern is pretty common with applications that come with various types of hardware. Serial ports aren't as common as they used to be, but USB chips which emulate serial ports are pretty popular in the design of specialized hardware. Characters from such things are handled at the application level, since the OS will have no way of knowing that a sequence of input characters means e.g. a cash drawer was opened and a notification should be sent somewhere.
JavierJ over 8 years

From the web point of view common knowledge (Internet, comments from experts) suggest that greatly increasing the max. number of request threads is a bad thing in blocking IO (making processing of requests even slower) due to memory increase and context switching time, but, isn't Async IO doing the same thing when deferring the job to another thread? Yes you can serve more requests now but have the same number of threads in the background.. what's the real benefit of that?
Werner Henze over 8 years

@JavierJ You seem to believe that if n threads do async file IO another n threads will be created to do a blocking file IO? This is not true. The OS has async file IO support and it does not need to block when waiting for IO to complete. It can queue IO requests and if a hardware (e.g. DMA) interrupt occurs it can mark the request as done and set an event that signals the callers thread. Even if an extra thread was required the OS would be able to use that thread for multiple IO requests from multiple threads.
JavierJ over 8 years

Thanks, it makes sense involving the OS async file IO support but when I write code for an actual implementation of this (from the web point of view) say with Java Servlet 3.0 NIO I still see a thread for the request and a background thread (async) looping to read a file, database or whatever.
Werner Henze over 8 years

@JavierJ This looks like an implementation detail of the Java runtime.
piyushGoyal over 8 years

@WernerHenze : As per you "But depending on the number of threads you'll surely not see any difference between asynchronous IO and using only a few write threads." --- what is the reason behind this? I can't figure out if my number of threads is smaller, why doing a non blocking I/O on them would not be of much use?
Werner Henze over 8 years

@piyushGoyal I rewote my answer. I hope it is clearer now.
piyushGoyal over 8 years

Yep.. the picture is clearer. So it means that if the I/O stuff is actually heavy then using non blocking I/O significant performance can be achieved. If it's not that heavy, then we won't be able to see much performance diff as compared to some threads doing the blocking I/O in background. Am I right to think this way? There is a question asked by me at stackoverflow.com/questions/34877705/… . I could somehow relate the answer through this post. Can you please help me looking out for exact answer there?
Harry Johnston over 7 years

On Windows using asynchronous file I/O means that writes must be of a size which is a multiple of the page size. - no, it doesn't. You're thinking of unbuffered I/O. (They are often used together, but they don't have to be.)
Werner Henze over 7 years

@HarryJohnston You are right. Thanks for pointing that out. I adjusted my answer.
Lifu Huang almost 4 years

I don’t think the part about CPU cost of blocking IO is accurate: when in blocking state, a thread that triggered blocking IO is put on wait by the OS and does not cost CPU periods until the IO has fully completed, only after which does the OS (notifies by interrupts) resume the blocked thread. What you described (busy wait by long polling) is not how blocking IO is implemented in almost any runtime/compiler.
stdout about 3 years

the threads you mention above are kernel-space threads i assume, right?
ng.newbie about 2 years

@FlorinDumitrescu The link is dead.