parallelize 'for' loop in Python 3
My guess is that you want to work on several files at the same time. To do so, the best way (in my opinion) is to use multiprocessing
. To use this, you need to define an elementary step, and it is already done in your code.
import numpy as np
import multiprocessing as mp
import os
def f(file):
mindex=np.zeros((1200,1200))
for i in range(1200):
var1 = xray.open_dataset(file)['variable'][:,i,:].data
for j in range(1200):
var2 = var1[:,j]
## Mathematical Calculations to find var3[i,j]##
mindex[i,j] = var3[i,j]
return (file, mindex)
if __name__ == '__main__':
N= mp.cpu_count()
files = os.scandir(folder)
with mp.Pool(processes = N) as p:
results = p.map(f, [file.name for file in files])
This should return a list of element results
in which each element is a tuple with the file name and the mindex matrix. With this, you can work on multiple files at the same time. It is particularly efficient if the computation on each file is long.
Nirav L Lekinwala
Updated on June 17, 2022Comments
-
Nirav L Lekinwala almost 2 years
I am trying to do some analysis of the MODIS satellite data. My code primarily reads a lot of files (806) of the dimension 1200 by 1200 (806*1200*1200). It do it using a
for loop
and perform mathematical operations.Following is the general way in which I read files.
mindex=np.zeros((1200,1200)) for i in range(1200): var1 = xray.open_dataset('filename.nc')['variable'][:,i,:].data for j in range(1200): var2 = var1[:,j] ## Mathematical Calculations to find var3[i,j]## mindex[i,j] = var3[i,j]
Since its a lot of data to handle, the process is very slow and I was considering parallelizing it. I tried doing something with
joblib
, but I have not been able to do it.I am unsure how to tackle this problem.
-
Vincent almost 6 yearsNote that multithreading sould work fine too if the computational part is done with numpy.
-
Mathieu almost 6 years@Vincent True numpy is indeed switching to C.
-
Mathieu almost 6 years@Vincent However, it might need more adaptation of the code especially to do array-wide operation to obtain mindex instead of this double for loop.
-
Vincent almost 6 yearsSince most of the time is probably spent during the IO operations and the numpy functions, I would expect
mp.pool.ThreadPool
to achieve the same performance (maybe even better sincemindex
doesn't have to be serialized). This would require a benchmark though.