Python multiprocessing Process crashes silently

13,832

Solution 1

What you really want is some way to pass exceptions up to the parent process, right? Then you can handle them however you want.

If you use concurrent.futures.ProcessPoolExecutor, this is automatic. If you use multiprocessing.Pool, it's trivial. If you use explicit Process and Queue, you have to do a bit of work, but it's not that much.

For example:

def run(self):
    try:
        for i in iter(self.inputQueue.get, 'STOP'):
            # (code that does stuff)
            1 / 0 # Dumb error
            # (more code that does stuff)
            self.outputQueue.put(result)
    except Exception as e:
        self.outputQueue.put(e)

Then, your calling code can just read Exceptions off the queue like anything else. Instead of this:

yield outq.pop()

do this:

result = outq.pop()
if isinstance(result, Exception):
    raise result
yield result

(I don't know what your actual parent-process queue-reading code does, because your minimal sample just ignores the queue. But hopefully this explains the idea, even though your real code doesn't actually work like this.)

This assumes that you want to abort on any unhandled exception that makes it up to run. If you want to pass back the exception and continue on to the next i in iter, just move the try into the for, instead of around it.

This also assumes that Exceptions are not valid values. If that's an issue, the simplest solution is to just push (result, exception) tuples:

def run(self):
    try:
        for i in iter(self.inputQueue.get, 'STOP'):
            # (code that does stuff)
            1 / 0 # Dumb error
            # (more code that does stuff)
            self.outputQueue.put((result, None))
    except Exception as e:
        self.outputQueue.put((None, e))

Then, your popping code does this:

result, exception = outq.pop()
if exception:
    raise exception
yield result

You may notice that this is similar to the node.js callback style, where you pass (err, result) to every callback. Yes, it's annoying, and you're going to mess up code in that style. But you're not actually using that anywhere except in the wrapper; all of your "application-level" code that gets values off the queue or gets called inside run just sees normal returns/yields and raised exceptions.

You may even want to consider building a Future to the spec of concurrent.futures (or using that class as-is), even though you're doing your job queuing and executing manually. It's not that hard, and it gives you a very nice API, especially for debugging.

Finally, it's worth noting that most code built around workers and queues can be made a lot simpler with an executor/pool design, even if you're absolutely sure you only want one worker per queue. Just scrap all the boilerplate, and turn the loop in the Worker.run method into a function (which just returns or raises as normal, instead of appending to a queue). On the calling side, again scrap all the boilerplate and just submit or map the job function with its parameters.

Your whole example can be reduced to:

def job(i):
    # (code that does stuff)
    1 / 0 # Dumb error
    # (more code that does stuff)
    return result

with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
    results = executor.map(job, range(10))

And it'll automatically handle exceptions properly.


As you mentioned in the comments, the traceback for an exception doesn't trace back into the child process; it only goes as far as the manual raise result call (or, if you're using a pool or executor, the guts of the pool or executor).

The reason is that multiprocessing.Queue is built on top of pickle, and pickling exceptions doesn't pickle their tracebacks. And the reason for that is that you can't pickle tracebacks. And the reason for that is that tracebacks are full of references to the local execution context, so making them work in another process would be very hard.

So… what can you do about this? Don't go looking for a fully general solution. Instead, think about what you actually need. 90% of the time, what you want is "log the exception, with traceback, and continue" or "print the exception, with traceback, to stderr and exit(1) like the default unhandled-exception handler". For either of those, you don't need to pass an exception at all; just format it on the child side and pass a string over. If you do need something more fancy, work out exactly what you need, and pass just enough information to manually put that together. If you don't know how to format tracebacks and exceptions, see the traceback module. It's pretty simple. And this means you don't need to get into the pickle machinery at all. (Not that it's very hard to copyreg a pickler or write a holder class with a __reduce__ method or anything, but if you don't need to, why learn all that?)

Solution 2

I suggest such workaround for showing process's exceptions

from multiprocessing import Process
import traceback


run_old = Process.run

def run_new(*args, **kwargs):
    try:
        run_old(*args, **kwargs)
    except (KeyboardInterrupt, SystemExit):
        raise
    except:
        traceback.print_exc(file=sys.stdout)

Process.run = run_new

Solution 3

This is not an answer, just an extended comment. Please run this program an tell us what output (if any) you get:

from multiprocessing import Process, Queue

class Worker(Process):

    def __init__(self, inputQueue, outputQueue):

        super(Worker, self).__init__()

        self.inputQueue = inputQueue
        self.outputQueue = outputQueue

    def run(self):

        for i in iter(self.inputQueue.get, 'STOP'):

            # (code that does stuff)

            1 / 0 # Dumb error

            # (more code that does stuff)

            self.outputQueue.put(result)

if __name__ == '__main__':
    inq, outq = Queue(), Queue()
    inq.put(1)
    inq.put('STOP')
    w = Worker(inq, outq)
    w.start()

I get:

% test.py
Process Worker-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/unutbu/pybin/test.py", line 21, in run
    1 / 0 # Dumb error
ZeroDivisionError: integer division or modulo by zero

I'm surprised (if) you get nothing.

Share:
13,832

Related videos on Youtube

hendra
Author by

hendra

Updated on July 19, 2022

Comments

  • hendra
    hendra over 1 year

    I'm using Python 2.7.3. I have parallelised some code using subclassed multiprocessing.Process objects. If there are no errors in the code in my subclassed Process objects, everything runs fine. But if there are errors in the code in my subclassed Process objects, they will apparently crash silently (no stacktrace printed to the parent shell) and CPU usage will drop to zero. The parent code never crashes, giving the impression that execution is just hanging. Meanwhile it's really difficult to track down where the error in the code is because no indication is given as to where the error is.

    I can't find any other questions on stackoverflow that deal with the same problem.

    I guess the subclassed Process objects appear to crash silently because they can't print an error message to the parent's shell, but I would like to know what I can do about it so that I can at least debug more efficiently (and so that other users of my code can tell me when they run into problems too).

    EDIT: my actual code is too complex, but a trivial example of a subclassed Process object with an error in it would be something like this:

    from multiprocessing import Process, Queue
    
    class Worker(Process):
    
        def __init__(self, inputQueue, outputQueue):
    
            super(Worker, self).__init__()
    
            self.inputQueue = inputQueue
            self.outputQueue = outputQueue
    
        def run(self):
    
            for i in iter(self.inputQueue.get, 'STOP'):
    
                # (code that does stuff)
    
                1 / 0 # Dumb error
    
                # (more code that does stuff)
    
                self.outputQueue.put(result)
    
    • Blender
      Blender about 11 years
      Can you post a minimal test case that illustrates this problem?
  • abarnert
    abarnert about 11 years
    I'd be surprised if he got nothing on POSIX in a shell. But on Windows, or in IDLE or PyDev, or if the parent process is a GUI app… I wouldn't be prepared to bet either way…
  • hendra
    hendra about 11 years
    @unutbu I got nothing. Using 64-bit Windows and IDLE.
  • unutbu
    unutbu about 11 years
    @npo: Okay, and what happens if you run it from a console?
  • hendra
    hendra about 11 years
    Thanks! This is great. But is there any way to print the entire stack trace? It tells me that there's an error now, and what it is, but not WHERE in the Worker class the error occurs.
  • abarnert
    abarnert about 11 years
    @npo: I'll add to the answer to explain that.
  • CMCDragonkai
    CMCDragonkai over 7 years
    How can this be applied to apply_async which just uses a function that is meant to return some result to a callback. Do we just wrap the internals of the asynchronous function in a try/except and then just return an exception object to the callback?
  • CloudyGloudy
    CloudyGloudy about 6 years
    simple, best answer