Using python's multiprocessing on slurm

16,283

Solution 1

Your current code will run 10 times on 5 processor, on a SINGLE node where you start it. It has nothing to do with SLURM now.

You will have to SBATCH the script to SLURM.

If you want to run this script on 5 cores with SLURM modify the script like this:

#!/usr/bin/python3

#SBATCH --output=wherever_you_want_to_store_the_output.log
#SBATCH --partition=whatever_the_name_of_your_SLURM_partition_is
#SBATCH -n 5 # 5 cores

import sys
import os
import multiprocessing

# Necessary to add cwd to path when script run
# by SLURM (since it executes a copy)
sys.path.append(os.getcwd())

def hello():
    print("Hello World")

pool = multiprocessing.Pool() 
jobs = [] 
for j in range(len(10)):
    p = multiprocessing.Process(target = run_rel)
    jobs.append(p)
    p.start() 

And then execute the script with

sbatch my_python_script.py

On one of the nodes where SLURM is installed

However this will allocate your job to a SINGLE node as well, so the speed will be the very same as you would just run it on a single node.

I dont know why would you want to run it on different nodes when you have just 5 processes. It will be faster just to run on one node. If you allocate more then 5 cores, in the beginning of the python script, then SLURM will allocate more nodes for you.

Solution 2

Just a hint: you need to understand what is core,thread,socket,CPU,node,task,job,jobstep in SLURM.

If there is absolutely no interactions between your script. Just use :

srun -n 20 python serial_script.py

SLURM will allocate resources for you automatically.

If you want to run 4 tasks on 4 nodes, with each task using 5 cores. You can use this command:

srun -n 4 -c 5 -N 4 -cpu_bind verbose,nodes python parallel_5_core_script.py

It will run 4 tasks (-n 4), on 4 nodes (-N 4). Each tasks will have resource of 5 cores (-c 5). The -cpu_bind verbose,nodes option indicates that each task will be run on each node (nodes), and the actual cpu_bind will be printed out (verbose).

However, there might be some weird behavior on CPU binding if your SLURM is configured differently from mine. Sometime it is very tricky. And python's multiprocessing module seems do not work well with SLURM's resource management, as indicated in your link.

Solution 3

dask is very useful in these situations. It has a nice interface with SLURM through dask-jobqueue. I highly recommend it.

https://docs.dask.org/en/latest/setup/hpc.html

Share:
16,283

Related videos on Youtube

physicsGuy
Author by

physicsGuy

This user is being kept inside a cloud of mystery.

Updated on October 17, 2022

Comments

  • physicsGuy
    physicsGuy over 1 year

    I am trying to run some parallel code on slurm, where the different processes do not need to communicate. Naively I used python's slurm package. However, it seems that I am only using the cpu's on one node.

    For example, if I have 4 nodes with 5 cpu's each, I will only run 5 processes at the same time. How can I tell multiprocessing to run on different nodes?

    The python code looks like the following

    import multiprocessing
    
    def hello():
        print("Hello World")
    
    pool = multiprocessing.Pool() 
    jobs = [] 
    for j in range(len(10)):
        p = multiprocessing.Process(target = run_rel)
        jobs.append(p)
        p.start() 
    

    The problem is similar to this one, but there it has not been solved in detail.

  • physicsGuy
    physicsGuy over 7 years
    Thanks for your answer. In reality I want to run about 40 process in parallel, which is why I need to use different nodes. I will try your code later this evening
  • physicsGuy
    physicsGuy over 7 years
    I tried your approach. However, python is only using a single node with the code above, while all other nodes stay unused. Is there an easy way around this?
  • CatDog
    CatDog about 6 years
    -n5 does not means you will have 5 cores. It just means you are going to run 5 tasks.
  • nuKs
    nuKs over 2 years
    Be wary dask "work stealing" is buggy regarding long processes on slurm at the moment (this probably actually applies to all dask distributed), and this has been for year, thus making this a very unreliable option for these scenario.