2 years ago

#69052

test-img

Raquel

Airflow DAG Dependencies / Orchestration between different workflows

My team is working orchestrating our data pipeline with Airflow. Since our pipeline steps are complex, we were thinking about having different DAGs / workflows, each defined on its own file. Each of the workflows can trigger more than one downstream workflow. Each workflow will output data to an S3 bucket at the end of execution. Broadly, it looks like the following options for orchestration between DAGs are available:

  • Using TriggerDagRunOperator at the end of each workflow to decide which downstream workflows to trigger
  • Using ExternalTaskSensor at the beginning of each workflow to run workflows once last parent workflow task has completed successfully
  • Using SubDagOperator, and orchestrating all workflows inside a primary DAG in a similar way in which tasks would be orchestrated. Child workflows should be created using a factory. It looks like this operator is being deprecated since it had performance and functional issues
  • Using TaskGroups: Looks like this is the successor of the SubDag operator. It groups tasks in the UI, however tasks still belong to the same DAG, we will not have multiple DAGs. Although since DAGs are coded in Python, we could still put each TaskGroup in an independent file, like mentioned here: https://stackoverflow.com/a/68394010
  • Using S3KeySensor to trigger workflow once file has been added to S3 bucket (since all our workflows will end up storing a dataset in S3)
  • Using S3 Events to trigger a lambda whenever a file is added to S3. That lambda will use the TriggerDagRunOperator to start downstream workflows

From my research, it looks like:

  • ExternalTaskSensors and S3 sensors add overhead to the workflow, since the task sensors have to reschedule constantly -- creating a massive task queue that creates lots of delays
  • Using S3 Events to trigger a Lambda that in turn starts child workflows would not have this issue, however, we would need to keep configuration in these lambdas about which child workflows to trigger, as well as adding additional components that complicate our architecture
  • SubDagOperator seemed the way to go, we can have several DAGs and manage all dependencies between them in a primary DAG in a Python file, however it is being deprecated since it has functional and performance issues
  • TaskGroups are the SubDagOperator successor, however since all tasks will still belong to the same DAG, I have concerns with how to perform operations such as BackFilling an individual task group, rerunning individual TaskGroups, maybe scheduling them with different intervals in the future etc

If anyone has experience with any of these approaches and could share some insights it would be greatly appreciated.

Thanks!

airflow

airflow-scheduler

airflow-2.x

mwaa

0 Answers

Your Answer

Accepted video resources