Airflow + celery or dask. For what, when?

16,203

Solution 1

In Airflow terminology an "Executor" is the component responsible for running your task. The LocalExecutor does this by spawning threads on the computer Airflow runs on and lets the thread execute the task.

Naturally your capacity is then limited by the available resources on the local machine. The CeleryExecutor distributes the load to several machines. The executor itself publishes a request to execute a task to a queue, and one of several worker nodes picks up the request and executes it. You can now scale the cluster of worker nodes to increase overall capacity.

Finally, and not ready yet, there's a KubernetesExecutor in the works (link). This will run tasks on a Kubernetes cluster. This will not only give your tasks complete isolation since they're run in containers, you can also leverage the existing capabilities in Kubernetes to for instance auto scale your cluster so that you always have an optimal amount of resources available.

Solution 2

You may enjoy reading this comparison of dask to celery/airflow task managers http://matthewrocklin.com/blog/work/2016/09/13/dask-and-celery

Since you are not asking a specific question, general reading like that should be informative, and maybe you can clarify what you are after.

-EDIT-

Some people coming to this more recently may wish to look into prefect, which is a sort of rewritten airflow with dask in mind (comes in open-source core with paid enterprise features).

Share:
16,203

Related videos on Youtube

Amelio Vazquez-Reina
Author by

Amelio Vazquez-Reina

I'm passionate about people, technology and research. Some of my favorite quotes: "Far better an approximate answer to the right question than an exact answer to the wrong question" -- J. Tukey, 1962. "Your title makes you a manager, your people make you a leader" -- Donna Dubinsky, quoted in "Trillion Dollar Coach", 2019.

Updated on July 28, 2022

Comments

  • Amelio Vazquez-Reina
    Amelio Vazquez-Reina almost 2 years

    I read in the official Airflow documentation the following:

    enter image description here

    What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like Celery? (same for dask)

  • Alan
    Alan over 5 years
    For LocalExecutor, tasks are executed as subprocess: ...If it happens to be the LocalExecutor, tasks will be executed as subprocesses; in the case of CeleryExecutor and MesosExecutor, tasks are executed remotely...
  • gogstad
    gogstad over 4 years
    An undercommunicated feature of SO is that it's a wiki (at least in some sense). You're absolutely right, please feel free to edit the original answer.