How to process a list in parallel in Python?

11,215

Solution 1

Assuming CPython and the GIL here.

If your task is I/O bound, in general, threading may be more efficient since the threads are simply dumping work on the operating system and idling until the I/O operation finishes. Spawning processes is a heavy way to babysit I/O.

However, most file systems aren't concurrent, so using multithreading or multiprocessing may not be any faster than synchronous writes.

Nonetheless, here's a contrived example of multiprocessing.Pool.map which may help with your CPU-bound work:

from multiprocessing import cpu_count, Pool

def process(data):
    # best to do heavy CPU-bound work here...

    # file write for demonstration
    with open("%s.txt" % data, "w") as f:
        f.write(data)

    # example of returning a result to the map
    return data.upper()
      
tasks = ["data1", "data2", "data3"]
pool = Pool(cpu_count() - 1)
print(pool.map(process, tasks))

A similar setup for threading can be found in concurrent.futures.ThreadPoolExecutor.

As an aside, all is a builtin function and isn't a great variable name choice.

Solution 2

Or:

from threading import Thread

def process(data):
    print("processing {}".format(data))

l= ["data1", "data2", "data3"]

for task in l:
    t = Thread(target=process, args=(task,))
    t.start()

Or (only python version > 3.6.0):

from threading import Thread

def process(data):
    print(f"processing {data}")

l= ["data1", "data2", "data3"]

for task in l:
    t = Thread(target=process, args=(task,))
    t.start()

Solution 3

There is a template of using multiprocessing, hope helpful.

from multiprocessing.dummy import Pool as ThreadPool

def process(data):
    print("processing {}".format(data))
alldata = ["data1", "data2", "data3"]

pool = ThreadPool()

results = pool.map(process, alldata)

pool.close()
pool.join()
Share:
11,215

Related videos on Youtube

Joan Venge
Author by

Joan Venge

Professional hitman.

Updated on June 04, 2022

Comments

  • Joan Venge
    Joan Venge almost 2 years

    I wrote code like this:

    def process(data):
       #create file using data
    
    all = ["data1", "data2", "data3"]
    

    I want to execute process function on my all list in parallel, because they are creating small files so I am not concerned about disk write but the processing takes long, so I want to use all of my cores.

    How can I do this using default modules in python 2.7?

    • Sraw
      Sraw over 5 years
      Help you to search next time, keywords: multiprocessing python map. Then you can easily find solution from Google.
  • Sraw
    Sraw over 5 years
    OP says "I want to use all of my cores". ThreadPool won't use more than one core.
  • Mauricio
    Mauricio over 2 years
    This solution for Python version > 3.6.0 is exactly what I was looking for.