What's the difference between ThreadPool vs Pool in the multiprocessing module?

65,456

The multiprocessing.pool.ThreadPool behaves the same as the multiprocessing.Pool with the only difference that uses threads instead of processes to run the workers logic.

The reason you see

hi outside of main()

being printed multiple times with the multiprocessing.Pool is due to the fact that the pool will spawn 5 independent processes. Each process will initialize its own Python interpreter and load the module resulting in the top level print being executed again.

Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads.

The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation. You can see its implementation in the source code.

I believe the next natural question is: when to use a thread based pool and when to use a process based one?

The rule of thumb is:

  • IO bound jobs -> multiprocessing.pool.ThreadPool
  • CPU bound jobs -> multiprocessing.Pool
  • Hybrid jobs -> depends on the workload, I usually prefer the multiprocessing.Pool due to the advantage process isolation brings

On Python 3 you might want to take a look at the concurrent.future.Executor pool implementations.

Share:
65,456
ozn
Author by

ozn

Updated on July 08, 2022

Comments

  • ozn
    ozn almost 2 years

    Whats the difference between ThreadPool and Pool in multiprocessing module. When I try my code out, this is the main difference I see:

    from multiprocessing import Pool
    import os, time
    
    print("hi outside of main()")
    
    def hello(x):
        print("inside hello()")
        print("Proccess id: ", os.getpid())
        time.sleep(3)
        return x*x
    
    if __name__ == "__main__":
        p = Pool(5)
        pool_output = p.map(hello, range(3))
    
        print(pool_output)
    

    I see the following output:

    hi outside of main()
    hi outside of main()
    hi outside of main()
    hi outside of main()
    hi outside of main()
    hi outside of main()
    inside hello()
    Proccess id:  13268
    inside hello()
    Proccess id:  11104
    inside hello()
    Proccess id:  13064
    [0, 1, 4]
    

    With "ThreadPool":

    from multiprocessing.pool import ThreadPool
    import os, time
    
    print("hi outside of main()")
    
    def hello(x):
        print("inside hello()")
        print("Proccess id: ", os.getpid())
        time.sleep(3)
        return x*x
    
    if __name__ == "__main__":
        p = ThreadPool(5)
        pool_output = p.map(hello, range(3))
    
        print(pool_output)
    

    I see the following output:

    hi outside of main()
    inside hello()
    inside hello()
    Proccess id:  15204
    Proccess id:  15204
    inside hello()
    Proccess id:  15204
    [0, 1, 4]
    

    My questions are:

    • why is the “outside __main__()” run each time in the Pool?

    • multiprocessing.pool.ThreadPool doesn't spawn new processes? It just creates new threads?

    • If so whats the difference between using multiprocessing.pool.ThreadPool as opposed to just threading module?

    I don't see any official documentation for ThreadPool anywhere, can someone help me out where I can find it?

  • ozn
    ozn over 6 years
    Thanks for the answer. I just want to understand this statement: Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads. Im assuming, the "spawn" and "fork" are implicit when I call the "map()" or "Pool()"? Or is this something I can control?
  • noxdafox
    noxdafox over 6 years
    The explanation is in the link I gave you above when mentioning the spawn start method. You can control it, but the start methods availability depends on the OS platform. I assume you are using Windows as the default start strategy is the spawn one. If so, there's little to do as Windows only support spawn.
  • Cedric H.
    Cedric H. over 5 years
    Is the comment about the unfinished implementation of ThreadPool still valid in 2019 with Python 3.7?
  • noxdafox
    noxdafox over 5 years
    Yes it is. As you can see from the linked source and the lack of documentation.
  • MrR
    MrR about 5 years
    Because the CPU is not the bottleneck hence threads can preempt and execute during the time that a thread would have been spinning cpu idle waiting for IO.
  • Spencer D
    Spencer D over 4 years
    @MrR, which is absolutely reasonable and true, but that does not actually address why IO bound jobs should prefer ThreadPool over a Pool (process); although, I would imagine that is answerable simply by common sense regarding the time it takes to fork off an entire subprocess as well as the additional overhead caused by not being able to share the same resources.
  • MrR
    MrR over 4 years
    if you can choose between threads and cpus, given the same benefits, you should always go threads as, yes, less overhead.
  • Walter Kelt
    Walter Kelt about 2 years
    Another reason to use Process as opposed to Thread is if the libraries involved in mp are NOT thread-safe. One such notable library is Pandas. If you want to use Pandas to execute several big data queries concurrently, than Process may be the safest way to go since Processes do not share thread state with one another.