Hence we built the Airflow “canary” monitoring system which aims to treat Airflow as a black-box and verify that it schedules and executes tasks in a reasonable amount of time. We didn’t have a good monitoring system to understand whether Airflow schedules tasks or not at that time. Previously, we had a production issue which caused Airflow not to schedule any task for an hour at Lyft. The overall system health dashboard for Airflow At Lyft, we leverage various technologies including Datadog, Statsd, Grafana, and PagerDuty to monitor the Airflow system. It is crucial to maintain the SLA and uptime for Airflow. Airflow Monitoring And Alerting are nearly five hundred DAGs running daily on Airflow. Numbers of DAGs / Tasks: 500+ DAGs, 800+ DagRuns, 25000+ TaskInstances running on Airflow platform at Lyft daily. The single node is used to process the compute-intensive workloads from a critical team’s DAGs. ASG #3: 1 worker node which is the m4.10xlarge type.This fleet of workers is dedicated for those DAGs with a strict SLA. ASG #2: 3 worker nodes each of which is the m4.4xlarge type.This fleet of workers is for processing low-priority memory intensive tasks. ASG #1: 15 worker nodes each of which is the r5.4xlarge type.Scale: Three sets of Amazon auto scaling group (ASG) for celery workers, each of which is associated with one celery queue:
#Airflow scheduler how to
Here we show how to deploy Airflow in production at Lyft:Ĭonfiguration: Apache Airflow 1.8.2 with cherry-picks, and numerous in-house Lyft customized patches. At Lyft, we leverage CeleryExecutor to scale out Airflow task execution with different celery workers in production. For example, the Kubernetes(k8s) operator and executor are added to Airflow 1.10 which provides native Kubernetes execution support for Airflow.
There are quite a few executors supported by Airflow.