Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

43,017

Solution 1

The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.

Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.

grep oom /var/log/messages. If you see messages, that's your problem.

If you don't find anything, try running the periodic process manually in a shell:

MyPeriodicTask().run()

And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.

Solution 2

One sees this kind of error when an asynchronous task (through celery) or the script you are using is storing a lot of data in memory because it leaks.

In my case, I was getting data from other system and saving it on a variable, so that I could export all data (into Django model / Excel file) after finishing the whole process.

Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.

To overcome the issue, I divided 10 million data into 20 parts (half million on each part). I stored the data into my own preferred local file / Django model every time the length of data reached 500,000 items. I repeated for every batch of 500k items.

No need to do the exact number of partitions. It is an idea of solving complex problem by splitting into multiple subproblem and solve the subproblems one by one. :D

Share:
43,017
daveoncode
Author by

daveoncode

Updated on July 08, 2022

Comments

  • daveoncode
    daveoncode almost 2 years

    I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor. The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:

    [01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
    Traceback (most recent call last):
      File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
        human_status(exitcode)),
    WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
    

    All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).

    I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?

    These are my celery settings:

    CELERY_TIMEZONE = 'UTC'
    CELERY_TASK_SERIALIZER = 'json'
    CELERY_ACCEPT_CONTENT = ['json']
    BROKER_URL = os.environ['RABBITMQ_URL']
    CELERY_IGNORE_RESULT = True
    CELERY_DISABLE_RATE_LIMITS = False
    CELERYD_HIJACK_ROOT_LOGGER = False
    

    And this is my supervisord.conf:

    [program:celery_worker]
    environment=$env_variables
    directory=/opt/python/current/app
    command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
    startsecs=10
    stopwaitsecs=60
    stopasgroup=true
    killasgroup=true
    autostart=true
    autorestart=true
    stdout_logfile=/opt/python/log/celery_worker.stdout.log
    stdout_logfile_maxbytes=5MB
    stdout_logfile_backups=10
    stderr_logfile=/opt/python/log/celery_worker.stderr.log
    stderr_logfile_maxbytes=5MB
    stderr_logfile_backups=10
    numprocs=1
    
    [program:celery_beat]
    environment=$env_variables
    directory=/opt/python/current/app
    command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
    startsecs=10
    stopwaitsecs=300
    stopasgroup=true
    killasgroup=true
    autostart=false
    autorestart=true
    stdout_logfile=/opt/python/log/celery_beat.stdout.log
    stdout_logfile_maxbytes=5MB
    stdout_logfile_backups=10
    stderr_logfile=/opt/python/log/celery_beat.stderr.log
    stderr_logfile_maxbytes=5MB
    stderr_logfile_backups=10
    numprocs=1
    

    Edit 1

    After restarting celery beat the problem remains.

    Edit 2

    Changed killasgroup=true to killasgroup=false and the problem remains.

  • daveoncode
    daveoncode about 10 years
    You are right "celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0" ...now I have to find why this happen, because previously it was running as expected :P Thank you very much!
  • Nino Walker
    Nino Walker about 10 years
    @daveoncode I think Lewis Carol once wrote, ""Beware the oom-killer, my son! The jaws that bite, the claws that catch!"
  • Rick Mohr
    Rick Mohr over 9 years
    On my Ubuntu box the log to check is /var/log/kern.log, not /var/log/messages
  • dashesy
    dashesy almost 9 years
    in my Ubuntu box it is /var/log/syslog (so much for consistency)
  • Buddhi741
    Buddhi741 over 6 years
    @daveoncode how did you go about finding why that happens. im also stuck at a similar position. and the problem is its happening for only one task and the according to compute engine it everything about cpu usage and memory seems fine
  • varnothing
    varnothing over 5 years
    I figured out it was a memory issue.
  • J Selecta
    J Selecta almost 5 years
    I was running celery workers on ecs with too little RAM per task and I also saw oom killing processes. So its not always related to memory leaks, but can also be the cause of not enough RAM.
  • Krishna
    Krishna almost 3 years
    @JSelecta Thanks. It really helped. The same issue was with my server. I was running multiple container on 1GB ram server. The celery worker needed 400MB for a specific task. When I upgraded to 2GB, it's working fine.