Which one to choose Apache Oozie or Apache Airflow? Need a comparison

19,257

In my experience Airflow is the best data pipeline right now. It's best suited for managing complex, long running workflows. UI and modularity are over the top.

Airflow

  • + Python Code for DAGs
  • + Has connectors for every major service/cloud provider
  • + More versatile
  • + Advanced metrics
  • + Better UI and API
  • + Capable of creating extremely complex workflows
  • + Jinja Templating
  • + Can be used as an Orchestrator for the Tensorflow Extended ecosystem
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

Oozie

  • --- Java or XML for DAGs
  • - hard to build complex pipelines
  • - smaller, less active community
  • - worse WEB GUI
  • - Java API
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

As you see, Airflow is an easier to use (especially in large heteregenoeus team), more versatile and powerful option than Oozie.

As I said: go with Airflow.

Article you may find interesting

Share:
19,257

Related videos on Youtube

Vishal786btc
Author by

Vishal786btc

Data Scientist in this life. I code for data. For Fun: https://www.youtube.com/channel/UCQk0GDiOtiwPoHs0uPfkxfQ/

Updated on June 04, 2022

Comments

  • Vishal786btc
    Vishal786btc almost 2 years

    I am new to job schedulers and was looking out for one to run jobs on big data cluster. I was quite confused with the available choices. Found Oozie to have many limitations as compared to the already existing ones such as TWS, Autosys, etc.

    Need some comparison points on Oozie vs. Airflow.

    Appreciate your help.

  • Michele 'Ubik' De Simoni
    Michele 'Ubik' De Simoni almost 6 years
    Another point for Airflow: Google now offers a fully managed version of Airflow distributed using Kubernetes via their new product: Composer
  • Stanislav Trifan
    Stanislav Trifan almost 6 years
    This looks to me as advertising response. Is really Java '-' ? What about groovy, jruby, jython... and other jvm based Lang's? To Mee looks better than python only. However python is nice lang. I can agree that it looks a little outdated, and see no point in that as for business it should not matter
  • Michele 'Ubik' De Simoni
    Michele 'Ubik' De Simoni almost 6 years
    If any other cloud provider steps up and offers something similar, I will update the comment, not having to manage your distributed clusters simplifies things by a long shot. While Python is unequivocally easier for people to pick up, easier to read and less verbose to write but its real strength is the direct access to the most used data science library. I am not saying that Java is inferior to Python however in this specific use case Python does make things easier.
  • ChikuMiku
    ChikuMiku over 5 years
    I use Oozie more for Data-Eng/Sc projects on Hadoop/Spark. For Python, we can use bashscript as shell action in Oozie and then let bash does all Python stuff. :)
  • attila_s
    attila_s over 4 years
    I'm not that familiar with Airflow, but I can add a few more things to consider: - Have you seen the Fluent API of Oozie ? It can be used to build complex pipelines. - You can use HUE as a Web UI github.com/cloudera/hue - Do you need to handle timezones? - How do you create Oozie like bundles? - How do you implement HA for the Airflow scheduler? SPoF? - Oozie is used by many companies for large scale dataprocessing. - Oozie was designed for Hadoop. What about delegation tokens in Airflow? - SLA for coordinators & workflows?