Tuesday, April 9, 2019

Bitnami Apache Airflow Multi-Tier Now Available in Azure Marketplace

Originally published on the Azure blog on April 9th, 2019.

A few months ago, we released a blog post that provided guidance on how to deploy Apache Airflow on Azure. The template provided a good quick start solution for anyone looking to quickly run and deploy Apache Airflow on Azure in sequential executor mode for testing and proof of concept study. However, the template was not designed for enterprise production deployments and required expert knowledge of Azure app services and container deployments to run it in Celery Executor mode. This is where we partnered with Bitnami to help simplify production grade deployments of Airflow on Azure for customers.

We are excited to announce that the Bitnami Apache Airflow Multi-Tier solution and the Apache Airflow Container are now available for customers in the Azure Marketplace. To see how easy it is to launch and start using them, check out the quick video tutorial below:



We are proud to say that the main committers to the Apache Airflow project have also tested this application to ensure that it was performed to the standards that they would expect.

Apache Airflow PMC Member and Core Committer Kaxil Naik said, “I am excited to see that Bitnami provided a Airflow Multi-Tier in the Azure Marketplace. Bitnami has removed the complexity of deploying the application for data scientists and data engineers, so they can focus on building the actual workflows or DAGs instead. Now, data scientists can create a cluster for themselves within about 20 minutes. They no longer need to wait for DevOps or a data engineer to provision one for them.”

What is Apache Airflow?

Apache Airflow is a popular open source workflow management tool used in orchestrating ETL pipelines, machine learning workflows, and many other creative use cases. It provides a scalable, distributed architecture that makes it simple to author, track and monitor workflows.

Users of Airflow create Directed Acyclic Graph (DAG) files to define the processes and tasks that must be executed, in what order, and their relationships and dependencies. DAG files are synchronized across nodes and the user will then leverage the UI or automation to schedule, execute and monitor their workflow.

Introduction to Apache Airflow Architecture

Bitnami Apache Airflow has a multi-tier distributed architecture that uses Celery Executor, which is recommended by Apache Airflow for production environments.

It is comprised of several synchronized nodes:

● Web server (UI)
● Scheduler
● Workers

It includes two managed Azure services:

● Azure Database for PostgreSQL
● Azure Cache for Redis

All nodes have a shared volume to synchronize DAG files.

DAG files are stored in a directory of the node. This directory is an external volume mounted in the same location in all nodes (both workers, scheduler, and web server). Since it is a shared volume, the files are automatically synchronized between servers. Add, modify or delete DAG files from this shared volume and the entire Airflow system will be updated.

You can also use DAGs from a GitHub repository. By using Git, you won’t have to access any of the Airflow nodes and you can just push the changes through the Git repository instead.

To automatically synchronize DAG files with Airflow, please refer to Bitnami’s documentation.

Bitnami’s Secret Sauce - Packaging for Production Use

Bitnami specializes in packaging multi-tier applications to work right out of the box leveraging the managed Azure services like Azure Database for PostgreSQL.

When packaging the Apache Airflow Multi-Tier solution, Bitnami added a few optimizations to ensure that it would work for production needs.

● Pre-packaged to leverage the most popular deployment strategies. For example, using PostgreSQL as the relational metadata store and the Celery executor.
● Role-based access control is enabled by default to secure access to the UI.
● The cache and the metadata store are Azure-native PaaS services that leverage the additional benefits those services offer, such as data redundancy and retention/recovery options as well as allowing Airflow to scale out to large jobs.
● All communication between Airflow nodes and the PostgreSQL database service is secured using SSL.

To learn more, join Azure, Apache Airflow, and Bitnami for a webinar on Wednesday, May 1st at 11:00 am PST - Register Now.

Get Started with Apache Airflow Multi-Tier Certified by Bitnami Today!