What's the difference between ThreadPool vs Pool in the multiprocessing module?
The multiprocessing.pool.ThreadPool
behaves the same as the multiprocessing.Pool
with the only difference that uses threads instead of processes to run the workers logic.
The reason you see
hi outside of main()
being printed multiple times with the multiprocessing.Pool
is due to the fact that the pool will spawn 5 independent processes. Each process will initialize its own Python interpreter and load the module resulting in the top level print
being executed again.
Note that this happens only if the spawn
process creation method is used (only method available on Windows). If you use the fork
one (Unix), you will see the message printed only once as for the threads.
The multiprocessing.pool.ThreadPool
is not documented as its implementation has never been completed. It lacks tests and documentation. You can see its implementation in the source code.
I believe the next natural question is: when to use a thread based pool and when to use a process based one?
The rule of thumb is:
- IO bound jobs ->
multiprocessing.pool.ThreadPool
- CPU bound jobs ->
multiprocessing.Pool
- Hybrid jobs -> depends on the workload, I usually prefer the
multiprocessing.Pool
due to the advantage process isolation brings
On Python 3 you might want to take a look at the concurrent.future.Executor
pool implementations.
ozn
Updated on July 08, 2022Comments
-
ozn almost 2 years
Whats the difference between
ThreadPool
andPool
inmultiprocessing
module. When I try my code out, this is the main difference I see:from multiprocessing import Pool import os, time print("hi outside of main()") def hello(x): print("inside hello()") print("Proccess id: ", os.getpid()) time.sleep(3) return x*x if __name__ == "__main__": p = Pool(5) pool_output = p.map(hello, range(3)) print(pool_output)
I see the following output:
hi outside of main() hi outside of main() hi outside of main() hi outside of main() hi outside of main() hi outside of main() inside hello() Proccess id: 13268 inside hello() Proccess id: 11104 inside hello() Proccess id: 13064 [0, 1, 4]
With "ThreadPool":
from multiprocessing.pool import ThreadPool import os, time print("hi outside of main()") def hello(x): print("inside hello()") print("Proccess id: ", os.getpid()) time.sleep(3) return x*x if __name__ == "__main__": p = ThreadPool(5) pool_output = p.map(hello, range(3)) print(pool_output)
I see the following output:
hi outside of main() inside hello() inside hello() Proccess id: 15204 Proccess id: 15204 inside hello() Proccess id: 15204 [0, 1, 4]
My questions are:
why is the “outside __main__()” run each time in the
Pool
?multiprocessing.pool.ThreadPool
doesn't spawn new processes? It just creates new threads?If so whats the difference between using
multiprocessing.pool.ThreadPool
as opposed to justthreading
module?
I don't see any official documentation for
ThreadPool
anywhere, can someone help me out where I can find it? -
ozn over 6 yearsThanks for the answer. I just want to understand this statement: Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads. Im assuming, the "spawn" and "fork" are implicit when I call the "map()" or "Pool()"? Or is this something I can control?
-
noxdafox over 6 yearsThe explanation is in the link I gave you above when mentioning the spawn start method. You can control it, but the start methods availability depends on the OS platform. I assume you are using Windows as the default start strategy is the
spawn
one. If so, there's little to do as Windows only supportspawn
. -
Cedric H. over 5 yearsIs the comment about the unfinished implementation of
ThreadPool
still valid in 2019 with Python 3.7? -
noxdafox over 5 yearsYes it is. As you can see from the linked source and the lack of documentation.
-
MrR about 5 yearsBecause the CPU is not the bottleneck hence threads can preempt and execute during the time that a thread would have been spinning cpu idle waiting for IO.
-
Spencer D over 4 years@MrR, which is absolutely reasonable and true, but that does not actually address why IO bound jobs should prefer ThreadPool over a Pool (process); although, I would imagine that is answerable simply by common sense regarding the time it takes to fork off an entire subprocess as well as the additional overhead caused by not being able to share the same resources.
-
MrR over 4 yearsif you can choose between threads and cpus, given the same benefits, you should always go threads as, yes, less overhead.
-
Walter Kelt about 2 yearsAnother reason to use Process as opposed to Thread is if the libraries involved in mp are NOT thread-safe. One such notable library is Pandas. If you want to use Pandas to execute several big data queries concurrently, than Process may be the safest way to go since Processes do not share thread state with one another.