Python Multiprocessing Process or Pool for what I am doing?

34,590

Solution 1

The two scenarios you listed accomplish the same thing but in slightly different ways.

The first scenario starts two separate processes (call them P1 and P2) and starts P1 running foo and P2 running bar, and then waits until both processes have finished their respective tasks.

The second scenario starts two processes (call them Q1 and Q2) and first starts foo on either Q1 or Q2, and then starts bar on either Q1 or Q2. Then the code waits until both function calls have returned.

So the net result is actually the same, but in the first case you're guaranteed to run foo and bar on different processes.

As for the specific questions you had about concurrency, the .join() method on a Process does indeed block until the process has finished, but because you called .start() on both P1 and P2 (in your first scenario) before joining, then both processes will run asynchronously. The interpreter will, however, wait until P1 finishes before attempting to wait for P2 to finish.

For your questions about the pool scenario, you should technically use pool.close() but it kind of depends on what you might need it for afterwards (if it just goes out of scope then you don't need to close it necessarily). pool.map() is a completely different kind of animal, because it distributes a bunch of arguments to the same function (asynchronously), across the pool processes, and then waits until all function calls have completed before returning the list of results.

Solution 2

Since you're fetching data from curl calls you are IO-bound. In such case grequests might come in handy. These are really neither processes nor threads but coroutines - lightweight threads. This would allow you to send asynchronously HTTP requests, and then use multiprocessing.Pool to speed up the CPU-bound part.

1) Since join blocks until calling process is completed...does this mean p1 process has to finish before p2 process is kicked off?

Yes, p2.join() is called after p1.join() has returned meaning p1 has finished.

1) Do I need a pool.close(), pool.join()

You could end up with orphaned processes without doing close() and join() (if the processes serve indefinetly)

2) Would pool.map() make them all complete before I could get results? And if so, are they still ran asynch?

They are ran asynchronously, but the map() is blocked until all tasks are done.

3) How would pool.apply_async() differ from doing each process with pool.apply()

pool.apply() is blocking, so basically you would do the processing synchronously.

4) How would this differ from the previous implementation with Process

Chances are a worker is done with foo before you apply bar so you might end up with a single worker doing all the work. Also, if one of your workers dies Pool automatically spawns a new one (you'd need to reapply the task).

To sum up: I would rather go with Pool - it's perfect for producer-consumer cases and takes care of all the task-distributing logic.

Share:
34,590
dman
Author by

dman

I like Linux, JavaScript, Polymer, and being outside Checkout #grandmasboy on freenode...chat with jayP bot from the movie!

Updated on August 12, 2020

Comments

  • dman
    dman over 3 years

    I'm new to multiprocessing in Python and trying to figure out if I should use Pool or Process for calling two functions async. The two functions I have make curl calls and parse the information into a 2 separate lists. Depending on the internet connection, each function could take about 4 seconds each. I realize that the bottleneck is in the ISP connection and multiprocessing won't speed it up much, but it would be nice to have them both kick off async. Plus, this is a great learning experience for me to get into python's multi-processing because I will be using it more later.

    I have read Python multiprocessing.Pool: when to use apply, apply_async or map? and it was useful, but still had my own questions.

    So one way I could do it is:

    def foo():
        pass
    
    def bar():
        pass
    
    p1 = Process(target=foo, args=())
    p2 = Process(target=bar, args=())
    
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    

    Questions I have for this implementation is: 1) Since join blocks until calling process is completed...does this mean p1 process has to finish before p2 process is kicked off? I always understood the .join() be the same as pool.apply() and pool.apply_sync().get() where the parent process can not launch another process(task) until the current one running is completed.

    The other alternative would be something like:

    def foo():
        pass
    
    def bar():
        pass
    pool = Pool(processes=2)             
    p1 = pool.apply_async(foo)
    p1 = pool.apply_async(bar)
    

    Questions I have for this implementation would be: 1) Do I need a pool.close(), pool.join()? 2) Would pool.map() make them all complete before I could get results? And if so, are they still ran asynch? 3) How would pool.apply_async() differ from doing each process with pool.apply() 4) How would this differ from the previous implementation with Process?

  • dman
    dman over 10 years
    Are you sure p1 process has to finish before p2 process is kicked off because join()? the output from bpaste.net/show/ruHgFTAAMkN4UT2INPqu looks like on p2 kicks off before p1 finishes.
  • Maciej Gol
    Maciej Gol over 10 years
    I meant, p2 will be joined after p1 has been finished joining. Sorry for misunderstanding. Of course, both processes kick off as soon as appropriate start() has returned.
  • SeasonalShot
    SeasonalShot over 5 years
    For the first part, does it run the process on multiple cores? What happens when number of process > number of cores?