Python multiprocessing: How to know to use Pool or Process?
Solution 1
I think the Pool
class is typically more convenient, but it depends whether you want your results ordered or unordered.
Say you want to create 4 random strings (e.g,. could be a random user ID generator or so):
import multiprocessing as mp
import random
import string
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
# Output
# ['yzQfA', 'PQpqM', 'SHZYV', 'PSNkD']
Here, the order probably doesn't matter. I am not sure if there is a better way to do it, but if I want to keep track of results in the order in which the functions are called, I typically return tuples with an ID as first item, e.g.,
# define a example function
def rand_string(length, pos, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put((pos, rand_str))
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, x, output)) for x in range(4)]
print(processes)
# Output
# [(1, '5lUya'), (3, 'QQvLr'), (0, 'KAQo6'), (2, 'nj6Q0')]
This let's me sort the results then:
results.sort()
results = [r[1] for r in results]
print(results)
# Output:
# ['KAQo6', '5lUya', 'nj6Q0', 'QQvLr']
The Pool class
Now to your question: How is this different from the Pool
class?
You'd typically prefer Pool.map
to return ordered list of results without going through the hoop of creating tuples and sorting them by ID. Thus, I would say it is typically more efficient.
def cube(x):
return x**3
pool = mp.Pool(processes=4)
results = pool.map(cube, range(1,7))
print(results)
# output:
# [1, 8, 27, 64, 125, 216]
Equivalently, there is also an "apply" method:
pool = mp.Pool(processes=4)
results = [pool.apply(cube, args=(x,)) for x in range(1,7)]
print(results)
# output:
# [1, 8, 27, 64, 125, 216]
Both Pool.apply
and Pool.map
will lock the main program until a process has finished.
Now, you also have Pool.apply_async
and Pool.map_async
, which return the result as soon as the process has finished, which is essentially similar to the Process
class above. The advantage may be that they provide you with the convenient apply
and map
functionality that you know from Python's in-built apply
and map
Solution 2
You can easily do this with pypeln:
import pypeln as pl
stage = pl.process.map(
CreateMatrixMp,
range(self.numPixels),
workers=poolCount,
maxsize=2,
)
# iterate over it in the main process
for x in stage:
# code
# or convert it to a list
data = list(stage)
Related videos on Youtube
pretzlstyle
Updated on February 20, 2020Comments
-
pretzlstyle about 4 years
So I have an algorithm I am writing, and the function
multiprocess
is supposed to call another function,CreateMatrixMp()
, on as many processes as there are cpus, in parallel. I have never done multiprocessing before, and cannot be certain which one of the below methods is more efficient. The word "efficient" being used in the context of the functionCreateMatrixMp()
needing to potentially be called thousands of times.I have read all of the documentation on the pythonmultiprocessing
module, and have come to these two possibilities:First is using the
Pool
class:def MatrixHelper(self, args): return self.CreateMatrix(*args) def Multiprocess(self, sigmaI, sigmaX): cpus = mp.cpu_count() print('Number of cpu\'s to process WM: %d' % cpus) poolCount = cpus*2 args = [(sigmaI, sigmaX, i) for i in range(self.numPixels)] pool = mp.Pool(processes = poolCount, maxtasksperchild= 2) tempData = pool.map(self.MatrixHelper, args) pool.close() pool.join()
And next is using the
Process
class:def Multiprocess(self, sigmaI, sigmaX): cpus = mp.cpu_count() print('Number of cpu\'s to process WM: %d' % cpus) processes = [mp.Process(target = self.CreateMatrixMp, args = (sigmaI, sigmaX, i,)) for i in range(self.numPixels)] for p in processes: p.start() for p in processes: p.join()
Pool
seems to be the better choice. I have read that it causes less overhead. AndProcess
does not consider the number of cpus on the machine. The only problem is that usingPool
in this manner gives me error after error, and whenever I fix one, there is a new one underneath it.Process
seems easier to implement, and for all I know it may be the better choice. What does your experience tell you?If
Pool
should be used, then am I right in choosingmap()
? It would be preferred that order is maintained. I havetempData = pool.map(...)
because themap
function is supposed to return a list of the results of every process. I am not sure howProcess
handles its returned data.-
Martin Alderete almost 9 yearsThis seems similar to stackoverflow.com/questions/18176178/… Thanks
-
pretzlstyle almost 9 years@MartinAlderete I have read that post. However, I ask some different questions here that need answering. He is using two target functions while I am using one, he passes no arguments while I pass multiple, and he does not have to be concerned with his target being an instance method, as mine is. I have done a lot of research on this so far, and both
Pool
andProcess
seem to behave differently under different contexts, and it certainly seems that one would be better in certain situations, while in others it wouldn't. I thought it appropriate to start a new question.
-
-
pretzlstyle almost 9 yearsThank you for the answer. My main concern in choosing one over the other is the consideration of the machines number of processors. In Pool, you can set it so that there are the same number of processes as cpus. But by using
Process
, the number of cpus comes in nowhere. Will this be handled on its own? Will it result in slightly more time to go through by usingProcess
? I would like to useProcess
, andPool
is giving me a very hard time with all of its pickle errors, butProcess
just doesn't feel concrete enough likePool
does. -
Admin almost 9 yearsI think "Process" is more of a "bare bones" approach, as far as I know, you'd have to manage it manually. When you spawn a process but all CPUs are busy, it will be queued until a CPU becomes free again. This could potentially be a problem if you are dispatching too many processes at once which are waiting (it could easily eat up your system's available memory if the number is "relatively" large)
-
pretzlstyle almost 9 yearsYes, the memory was a problem when trying to use
Process
. I have gotten it to work withPool
, thank you for your advice, I'll mark this as answered.