GCP – Apache Airflow ETL in Google Cloud
Are you thinking about running Apache Airflow on Google Cloud? That’s a popular choice for running a complex set of tasks, such as Extract, Transform, and Load (ETL) or data analytics pipelines. Apache Airflow uses a Directed Acyclic Graph (DAG) to order and relate multiple tasks for your workflows, including setting a schedule to run the desired task at a set time, providing a powerful way to perform scheduling and dependency graphing.
So what are the different ways to run Apache Airflow on Google Cloud? The wrong choice could reduce availability or increase costs — the infrastructure could fail, or you may need to create many environments, such as dev, staging, and prod. In this post, we’ll look at three ways to run Apache Airflow on Google Cloud and discuss the pros and cons of each approach. For each approach, we provide Terraform code that you can find on GitHub, so you can try it out for yourself.
Note: The Terraform used in this article has a directory structure. The files under modules are no different in format than the default code provided by Terraform. If you’re a developer, think of the modules directory as a kind of library. The main.tf file is where the actual business code goes. Imagine you’re doing development: start with main.tf and put the code we use in common in directories like modules, library, etc.)
Let’s look at our three ways to run Apache Airflow
1: Compute Engine
A common way to run Airflow on Google Cloud is to install and run Airflow directly on a Compute Engine VM instance. The advantages of this approach:
it’s cheaper than the others
it only requires an understanding of virtual machines.
On the other hand, there are also disadvantages:
You have to maintain the virtual machine.
It’s less available.
The disadvantages can be substantial, but if you’re thinking about adopting Airflow, you can use Compute Engine to do a quick proof of concept.
First, create a Compute Engine instance with the following terraform code (for brevity, some of the code has been omitted). The allow is a firewall setting. 8080 is the default port used by Airflow web, so it should be open. Feel free to change the other settings.
<ListValue: [StructValue([(‘code’, ‘# main.tfrnmodule “gcp_compute_engine” {rn source = “./modules/google_compute_engine”rn service_name = local.service_namernrn region = local.regionrn zone = local.zonern machine_type = “e2-standard-4″rn allow = {rn …rn 2 = {rn protocol = “tcp”rn ports = [“22”, “8080”]rn }rn }rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f0a0>)])]>
In the google_compute_engine directory, which we call as source in main.tf above, we have the following files and code that takes the values we passed in earlier and actually creates an instance for us — notice how it takes in the machine_type.
<ListValue: [StructValue([(‘code’, ‘# modules/google_compute_engine/google_compute_instance.tfrnresource “google_compute_instance” “default” {rn name = var.service_namern machine_type = var.machine_typern zone = var.zonern …rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f4f0>)])]>
Run the code you wrote above with Terraform:
<ListValue: [StructValue([(‘code’, ‘$ terraform apply’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f280>)])]>
Wait for a few moments and an instance will be created on Compute Engine. Next, you’ll need to connect to the instance and install Airflow — see the official documentation for instructions. Once installed, run Airflow.
You can now access Airflow through your browser! If you plan to run Airflow on Compute Engine, you’ll need to be extra careful with your firewall settings. Even if the password is compromised, only authorized users should be able to access it. Since this is a demo, we’ve made it accessible with minimal firewall settings.
After logging in, you should see a screen like the one below. You’ll also see a sample DAG provided by Airflow. Take a look around the screen.
2: GKE Autopilot
The second way to run Apache Airflow on Google Cloud is with Kubernetes, made very easy with Google Kubernetes Engine (GKE), Google’s managed Kubernetes service. You can also use GKE Autopilot mode of operation, which will help you avoid running out of compute resources and automatically scale your cluster based on your needs. GKE Autopilot is serverless, so you don’t have to manage your own Kubernetes nodes.
GKE Autopilot offers high availability and scalability. You can also leverage the powerful Kubernetes ecosystem. For example, you can use the kubectl command for fine-grained control of workloads and monitor them alongside other business services in your cluster. However, if you’re not very familiar with Kubernetes knowledge, you may end up spending a lot of time managing Kubernetes instead of focusing on Airflow with this approach.
All right, so we’re going to create a GKE Autopilot cluster first. The Terraform module does the minimal setup for us:
<ListValue: [StructValue([(‘code’, ‘# main.tfrnmodule “google_kubernetes_engine” {rn source = “./modules/google_kubernetes_engine”rn project_id = var.project_idrn service_name = local.service_namern region = local.regionrn network_id = module.google_compute_engine.network_idrn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78fa00>)])]>
The modules/google_kubernetes_engine.tf file is organized like below. Note that the enable_autopilot setting is True, and there is code for creating networks. You can check out the full code on GitHub.
<ListValue: [StructValue([(‘code’, ‘# modules/google_kubernetes_engine.tfrnresource “google_container_cluster” “this” {rn project = var.project_idrn name = “${var.service_name}-gke-cluster”rn location = var.regionrn enable_autopilot = truern network = var.google_compute_network_idrn ip_allocation_policy {}rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78fdf0>)])]>
Wow, we’re done already. Next, run the generated code to create a GKE Autopilot cluster:
<ListValue: [StructValue([(‘code’, ‘$ terraform apply’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f2e0>)])]>
Next, you’ll need to configure cluster access so that you can check the status of GKE Autopilot using the kubectl command. Please refer to the official documentation link for the relevant configuration.
Now deploy Airflow via Helm to the created GKE Autopilot cluster:
<ListValue: [StructValue([(‘code’, ‘# helm_main.tfrnresource “helm_release” “airflow” {rn name = “airflow”rn repository = “https://airflow.apache.org”rn chart = “airflow”rn version = “1.9.0”rn namespace = “airflow”rn create_namespace = truern wait = falsernrn depends_on = [rn module.google_kubernetes_enginern ]rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78ffd0>)])]>
Deploy it again via Terraform:
<ListValue: [StructValue([(‘code’, ‘$ terraform apploy’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f190>)])]>
Now, if you run the kubectl command, you should see something similar to the following:
<ListValue: [StructValue([(‘code’, ‘$ kubectl get pods -n airflowrnNAME READY STATUS RESTARTS AGErnairflow-postgresql-0 1/1 Running 0 25mrnairflow-redis-0 1/1 Running 0 25mrnairflow-scheduler-tvqgq 2/2 Running 0 18mrnairflow-statsd-ph5p6 1/1 Running 0 25mrnairflow-triggerer-r5q2h 2/2 Running 0 25mrnairflow-webserver-lc6gj 1/1 Running 0 25mrnairflow-worker-0 2/2 Running 0 25m’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f5e0>)])]>
Once you’ve verified that your pods are up and running, port-forward them to Airflow web access:
<ListValue: [StructValue([(‘code’, ‘$ kubectl port-forward svc/airflow-webserver -n airflow 8080rnForwarding from 127.0.0.1:8080 -> 8080rnForwarding from [::1]:8080 -> 8080’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f3d0>)])]>
Now try connecting to localhost:8080 in your browser.
If you want to customize the Airflow settings, you’ll need to modify the Helm chart. You can do this by downloading and managing the Airflow manifests.yaml file. You can set the values through the values setting as shown below. Make sure you have variables like repo, branch set in the yaml file:
<ListValue: [StructValue([(‘code’, ‘# helm_main.tfrnresource “helm_release” “airflow” {rn name = “airflow”rn repository = “https://airflow.apache.org”rn chart = “airflow”rn version = “1.9.0”rn namespace = “airflow”rn create_namespace = truern wait = falsern values = [templatefile(“../manifests/airflow/values.yaml”, {rn repo = “git@github.com:jybaek/example.git”rn branch = “main”rn })]rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78f400>)])]>
3: Cloud Composer
The third way is to use Cloud Composer, a fully managed data workflow orchestration service on Google Cloud. As a managed service, Cloud Composer makes it really simple to run Airflow, so you don’t have to worry about the infrastructure on which Airflow runs. Itpresents fewer options, however. For example, an uncommon situation is that you cannot share storage between DAGs. You may also need to ensure you balance CPU and memory usage as you have less ability to customize those options.
Take a look at the code below:
<ListValue: [StructValue([(‘code’, ‘# main.tfrnmodule “google_cloud_composer” {rn source = “./modules/google_cloud_composer”rn environment_size = “ENVIRONMENT_SIZE_SMALL”rn network_id = module.google_compute_engine.network_idrn subnetwork_id = module.google_compute_engine.subnetwork_idrn service_account = module.gcp.service_account_namern project_id = var.project_idrn region = local.regionrn service_name = local.service_namern}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78fe80>)])]>
If you look at the file stored under modules directory you’ll notice that: environment_size is being taken over and used.
<ListValue: [StructValue([(‘code’, ‘# modules/google_cloud_composer/google_composer_environment.tfrnresource “google_composer_environment” “this” {rn …rn config {rn software_config {rn image_version = “composer-2-airflow-2″rn }rnrn environment_size = var.environment_sizernrn node_config {rn network = var.google_compute_network_idrn subnetwork = var.google_compute_subnetwork_idrn service_account = var.google_service_account_namern }rn }rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bdc78fc10>)])]>
As a side note, you can also preset valid values when passing in a value, by putting a condition in the validation, as shown below:
<ListValue: [StructValue([(‘code’, ‘# modules/google_cloud_composer/variables.tfrnvariable “environment_size” {rn description = “environment_size”rn type = stringrnrn validation {rn condition = contains([“ENVIRONMENT_SIZE_SMALL”, “ENVIRONMENT_SIZE_MEDIUM”, “ENVIRONMENT_SIZE_LARGE”], var.environment_size)rn error_message = “Invalid value”rn }rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bf2ae16a0>)])]>
Note that Cloud Composer also supports Custom mode, which is different from other cloud service providers’ managed Airflow services. In addition to specifying standard environments such as ENVIRONMENT_SIZE_SMALL, ENVIRONMENT_SIZE_MEDIUM, and ENVIRONMENT_SIZE_LARGE, you can also control CPU and memory directly.
Now, let’s deploy to Terraform:
<ListValue: [StructValue([(‘code’, ‘$ terraform apply’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3bf2ae17c0>)])]>
Now, if you go to the Google Cloud console and look in the Composer menu, you should see the resource you just created:
Finally, let’s connect to Airflow by clicking the link to the Airflow webserver entry above. If you have the correct IAM permissions, you should see something like the screen below:
Wrap up
If you’re going to run Airflow in production, there are three things you need to think about: cost, performance, and availability. In this article, we’ve discussed three different ways to run Apache Airflow on Google Cloud, each with its own personality, pros and cons.
Note that these are the minimum criteria for choosing an Airflow environment. If you’re running a side project on Airflow, coding in Python to create a DAG may be sufficient. However, if you want to run Airflow in production, you’ll also need to properly configure Airflow Core (Concurrency, parallelism, SQL Pool size, etc.), Executor (LocalExecutor, CeleryExecutor, KubernetesExecutor, …), and so on. I hope this article will be helpful for those who are thinking about choosing an Airflow environment. Check out the full code on GitHub.
Read More for the details.