What's the difference between ThreadPool vs Pool in the multiprocessing module?

python python-3.x multiprocessing threadpool python-multiprocessing

65,456

The multiprocessing.pool.ThreadPool behaves the same as the multiprocessing.Pool with the only difference that uses threads instead of processes to run the workers logic.

The reason you see

hi outside of main()

being printed multiple times with the multiprocessing.Pool is due to the fact that the pool will spawn 5 independent processes. Each process will initialize its own Python interpreter and load the module resulting in the top level print being executed again.

Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads.

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation. You can see its implementation in the source code.

I believe the next natural question is: when to use a thread based pool and when to use a process based one?

The rule of thumb is:

IO bound jobs -> multiprocessing.pool.ThreadPool
CPU bound jobs -> multiprocessing.Pool
Hybrid jobs -> depends on the workload, I usually prefer the multiprocessing.Pool due to the advantage process isolation brings

On Python 3 you might want to take a look at the concurrent.future.Executor pool implementations.

65,456

Author by

ozn

Updated on July 08, 2022

Comments

ozn almost 2 years

Whats the difference between ThreadPool and Pool in multiprocessing module. When I try my code out, this is the main difference I see:

from multiprocessing import Pool
import os, time

print("hi outside of main()")

def hello(x):
    print("inside hello()")
    print("Proccess id: ", os.getpid())
    time.sleep(3)
    return x*x

if __name__ == "__main__":
    p = Pool(5)
    pool_output = p.map(hello, range(3))

    print(pool_output)

I see the following output:

hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
inside hello()
Proccess id:  13268
inside hello()
Proccess id:  11104
inside hello()
Proccess id:  13064
[0, 1, 4]

With "ThreadPool":

from multiprocessing.pool import ThreadPool
import os, time

print("hi outside of main()")

def hello(x):
    print("inside hello()")
    print("Proccess id: ", os.getpid())
    time.sleep(3)
    return x*x

if __name__ == "__main__":
    p = ThreadPool(5)
    pool_output = p.map(hello, range(3))

    print(pool_output)

I see the following output:

hi outside of main()
inside hello()
inside hello()
Proccess id:  15204
Proccess id:  15204
inside hello()
Proccess id:  15204
[0, 1, 4]

My questions are:

why is the “outside __main__()” run each time in the Pool?
multiprocessing.pool.ThreadPool doesn't spawn new processes? It just creates new threads?
If so whats the difference between using multiprocessing.pool.ThreadPool as opposed to just threading module?

I don't see any official documentation for ThreadPool anywhere, can someone help me out where I can find it?

ozn over 6 years

Thanks for the answer. I just want to understand this statement: Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads. Im assuming, the "spawn" and "fork" are implicit when I call the "map()" or "Pool()"? Or is this something I can control?
noxdafox over 6 years

The explanation is in the link I gave you above when mentioning the spawn start method. You can control it, but the start methods availability depends on the OS platform. I assume you are using Windows as the default start strategy is the spawn one. If so, there's little to do as Windows only support spawn.
Cedric H. over 5 years

Is the comment about the unfinished implementation of ThreadPool still valid in 2019 with Python 3.7?
noxdafox over 5 years

Yes it is. As you can see from the linked source and the lack of documentation.
MrR about 5 years

Because the CPU is not the bottleneck hence threads can preempt and execute during the time that a thread would have been spinning cpu idle waiting for IO.
Spencer D over 4 years

@MrR, which is absolutely reasonable and true, but that does not actually address why IO bound jobs should prefer ThreadPool over a Pool (process); although, I would imagine that is answerable simply by common sense regarding the time it takes to fork off an entire subprocess as well as the additional overhead caused by not being able to share the same resources.
MrR over 4 years

if you can choose between threads and cpus, given the same benefits, you should always go threads as, yes, less overhead.
Walter Kelt about 2 years

Another reason to use Process as opposed to Thread is if the libraries involved in mp are NOT thread-safe. One such notable library is Pandas. If you want to use Pandas to execute several big data queries concurrently, than Process may be the safest way to go since Processes do not share thread state with one another.