Automate DB Backup with Apache Airflow

Photo by 夜 咔罗 on Unsplash

Automate DB Backup with Apache Airflow

If like me, you have administered a database and always find yourself tired of running the scripts to back up the database or even worse, you use a cronjob to automate the backup process, then something goes wrong and suddenly the backup fails and you have to manually log into the server, go through the tedious work of checking the logs (what a frustration…..I have been there). Well, worry no more because Apache Airflow is here to save the day. In this article, we will learn how to use Apache Airflow to automate database backup workflow.

Who should read this article?

This article is intended for developers who are in operations, DevOps engineers or developers who are interested in automation.

Pre-requisites

To make use of this tutorial, you should be familiar with or have the following installed:

• Knowledge of Python

• Knowledge of basic shell programming

• Have Python3+ installed

• Have pip installed

• Postgres database installed

What is Apache Airflow?

Apache Airflow is a workflow management platform which is used to build and automate data pipelines. Apache Airflow is built on top of Python and is open-source. This means that all the workflows are defined using pure Python. It shines for its web GUI which visualizes all the tasks involved in the workflow, related logs and so much more. This makes using Apache Airflow much more convenient than traditional tools such as cronjob. For more information about Airflow and its architecture check here (airflow.apache.org).

For the sake of this tutorial, we will use Postgres database as our database of choice.

This article is broken down into two parts. The first part shows the setting up of the database (if you already have a database setup you can skip to part two). Part two walks through configuring Apache Airflow to automate the database backup.

Part one: Setting up Postgres database

1. Create a database

Create a database in Postgres database. I am naming the database in this tutorial aiflow_demo. Use the create database command to create the database

create database airflow_demo;

2. Create a user name and grant access to the database

Next, create a username with a password and then give the user all privileges to access the database. Refer to the commands below .

create user airflow_user with encrypted password 'password';

grant all privileges on database airflow_demo to airflow_user;

3. Populate the database with sample data

Let’s now fill the database with sample data. First, we will connect to the database, then create a table and finally add some data to that table.

\c airflow_demo

CREATE TABLE users (id serial PRIMARY KEY, username VARCHAR(20) NOT NULL, password VARCHAR(15) NOT NULL, email VARCHAR(20) NOT NULL);

INSERT INTO users (username, password, email) VALUES ('user1','12344','example@mail.com'), ('user2', '12345', 'example2@example.com');

To verify if data was populated in the table, use the select command as shown below

select * from users;
 id | username | password |        email         
----+----------+----------+----------------------
  1 | user1    | 12344    | example@mail.com
  2 | user2    | 12345    | example2@example.com
(2 rows)

Part two: Configuring Apache Airflow

Now that we have the database setup and have initial data in it, let’s get to the main part which automating the backup process.

1. Write a script to dump the database

First, let’s write a script that has code to dump the database we just set up. We will create a bash script named backup.sh and then add the command to dump the database. The code below does the job.

#!/bin/bash

cd ~/Documents/Technical-Writing/airflow # replace this with the path to your working directory
date_today="$(date '+%Y-%m-%d')"
pg_dump -d airflow_demo > airflow_dump-$date_today.sql

For the script to run, you need to enable execution permission. Run this command in the terminal to add the permission chmod +x path_to_backup.sh. Run the script using ./path_to_working_dir/backup.sh. After running this command you should be able to see the file named airflow_dump-dd-mm-year.sql. For this example, the created file is airflow_dump-2022-12-18.sql.

Now that we have our script ready, let’s go ahead and set up the Apache workflow.

2. Install and setup Apache Airflow

Run the following commands in your terminal to install Apache Airflow

####### create and activate virtual environment #######
# Create a virtual environment
python3 -m venv path_to_env_directory

# activate the virtual environment
source path_to_env_directory/bin/activate
####### Install Airflow #######
# Airflow needs a home. `~/airflow` is the default, but you can put it
# somewhere else if you prefer (optional)
export AIRFLOW_HOME=~/path_to/airflow

# Install Airflow using the constraints file
AIRFLOW_VERSION=2.5.0
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Airflow now should be installed with some initial files in a structure like the one below:

├── airflow.cfg
├── logs
│   └── scheduler
│       ├── 2022-12-17
│       ├── 2022-12-18
│       └── latest
└── webserver_config.py

airflow.cfg is the airflow configuration file. This is where Airflow settings are configured such as the database backend, timezones, executors and much more. For more info about Apache Airflow configuration file check here.

logs directory contains all the logs related to the task, scheduler, workers and much more

webserver_config.py contains the default configurations of the Airflow webserver that serves the GUI. This is where you can configure the UI themes, authentication and much more.

We will use Postgres database as the Airflow storage backend and LocalExector which enables us to run tasks in parallel. For more information on how to set up database backends check here and Executors here.

To configure the storage backend and executor, modify the following settings in airflow.cfg. You can create a separate database for Airflow using steps 1 & 2 in part 1. Setting up Postgres storage backend requires you to install psycopg2 driver. To install psycopg2 run: pip install psycopg2.

sql_alchemy_conn = postgresql+psycopg2://airflow_user:password@localhost/airflow_db

executor = LocalExecutor

Now let's test our airflow setup

# initialise the database
airflow db init 
# create airflow admin user
airflow users create \
    --username admin \
    --firstname Test \
    --lastname  User\
    --role Admin \
    --email email@example.org
# you will get a prompt to enter the password
# start the webserver
airflow webserver -p 8080 # you can use any open port
# start the scheduler in a separate terminal
# airflow scheduler

If you navigate to http://localhost:8080/ you should be directed to the Airflow login page (screenshot below).

airflow login

Now that our Airflow is setup, let's set up the workflow to backup the database

3. Create the database backup workflow

We are going to set up the backup workflow. First, we will back up the database using the script we created earlier, then check if the SQL dump file was created and finally upload it to a backup server.

To create the workflow, we must create a folder called dags in the working director. DAGs (Directed Acyclic Graphs) in Apache Airflow are collections of tasks that you want to run and are organised in a manner that reflects their order and dependencies. Essentially, in Airflow, DAGs are created and defined using Python and are placed inside the dags directory. After creating the directory the new structure should look like this:

├── airflow-webserver.pid
├── airflow.cfg
├── dags
├── logs
│   └── scheduler
│       ├── 2022-12-17
│       ├── 2022-12-18
│       ├── 2022-12-19
│       └── latest scheduler/2022-12-19
└── webserver_config.py
3.1. Create database dump task

Let's create a file named db_backup.py inside the dags folder. This is where we will define all the tasks for running the database backups. Add the following code inside the new file

# db_backup.py
import os
from airflow.operators.bash_operator import BashOperator
from airflow import DAG
import datetime

dag = DAG(
'db_backup',
start_date = datetime.datetime(2015, 6, 1),
schedule_interval = '30 4 * * *', # schedule to run every day at 4:30 AM
catchup=False
)

with dag:
  dump_db = BashOperator(
    task_id="dump_db",
    # "scripts" folder is under "path_to/airflow/dags"
    bash_command="scripts/backup.sh",
    dag=dag,
  )
  dump_db

Note that we have moved the backup.sh inside a new folder that we created named scripts inside the dags directory.

Airflow has a concept called Operators which are means by which it operates on tasks. There are many different types of operators which includes BashOperator. As the name indicates BashOperator operates on bash-related tasks. Each operator is passed a task_id that is unique to that specific task. Specific to BashOperator we pass bash_command which tells the operator which command or script to run. Here, we are telling the BashOperator to run the script named backup.sh located in the scripts directory. It is ideal to provide a relative path to the dag file db_backup.py. For additional information on Operators, check the documentation here

3.2. Running database dump task

Now that we have added the code to run the database backup, let's run the database dump workflow using the Airflow GUI. To do so, we will first start the Airflow webserver and scheduler. run the commands below:

Airflow webserver
Airflow scheduler # run this in a separate terminal

After starting the webserver, use the credentials generated in step 2 to login into the Airflow dashboard.

airflow dashboard

After logging into Airflow, you should have a dashboard like in the screenshot above. To run the database backup workflow follow the following steps as indicated in the screenshot:

  1. On the dashboard, a listing of the available DAGs will show. Click on the play button on the far right corner of the db_backup DAG.

  2. On clicking the play button a popup will display and you should choose trigger DAG. This will run our db_backup DAG.

  3. If the DAG runs successfully, you should have a green status circle on the runs column. This column highlights the number of runs for a particular DAG and its status. Now check the root directory of the airflow if you have the database dump generated.

Note that this version of the Airflow as of writing is 2.5.0 and if you are using a different version you may have a slightly different interface that the one in this article.

3.3. Create database upload task

Next, we are going to create the task to upload the database to our backup server. For this task, we will use the SFTOperator which essentially uses the SFTP protocol to upload files to a remote server. Make sure to have a remote server setup and accessible via ssh on port 22. Add the following code to db_backup.py

# db_backup.py
import os
from airflow.operators.bash_operator import BashOperator
from airflow import DAG
import datetime

dag = DAG(
'db_backup',
start_date = datetime.datetime(2015, 6, 1),
schedule_interval = '30 4 * * *', # schedule to run every day at 4:30 AM
catchup=False
)

with dag:
  dump_db = BashOperator(
    task_id="dump_db",
    # "scripts" folder is under "path_to/airflow/dags"
    bash_command="scripts/backup.sh",
    dag=dag,
  )

  # add new code here
  date_today = datetime.date.today().strftime("%Y-%m-%d")
  file_name = f'aiflow_dump-{date_today}.sql'
  db_filepath = f'/Users/arnold/Documents/Technical-Writing/airflow/{file_name}' 
  remote_dir = "/home/arnold/db_backup"
  db_remote_upload = SFTPOperator(
            task_id=f'{file_name}_upload',
            ssh_hook=SSHHook(ssh_conn_id="backup_server"),
            db_filepath=db_filepath,
            remote_filepath=f'{remote_dir}/{file_name}',
            operation="put",
            dag=dag
            )

  dump_db>>db_remote_upload

From the code above, we pass a couple of arguments to the SFTPOperator:

  • task_id: unique identifier for this particular task. This id will be used in logs.

  • ssh_hook: Uses the SSHHook operator to connect to the backup server where we want to upload our db backup.

  • ssh_conn_id: a dictionary that has the details of the ssh login such as server IP, username and password.

  • local_filepath: absolute path where our database dump file is stored

  • remote_filepath: absolute path on the remote server where we want to store our database dump.

  • operation: operation method to use to upload the DB file. We use put here.

  • dag: specifies which dag this task instance will run under

dump_db>>db_remote_upload This line is where we define the order in which the tasks will be run. Here db_remote_uplod is dependent on db_dump task to complete. The order is defined using the bitwise operator >>.

Be sure to install ssh and sftp providers if they are not installed. To do so run the following commands:

pip install apache-airflow-providers-sftp
pip install apache-airflow-providers-ssh
3.4. Configure ssh_conn_id in the Airflow GUI

Before we test our workflow, let's first configure the backup server connection so that Airflow knows where to upload our db dump. Follow the steps in the screen capture below. I have omitted the server host IP address for security reasons.

setting connection

You should now have the backup server ssh info setup. Airflow will replace the details of the ssh_conn_id argument we passed to the SSHHook operator with the details we just configured. You can pass the private key in the extras field if you prefer to use ssh keys instead of a password. For more information on how to setup the ssh connection check here

3.5. Run the DB backup workflow

Now that everything is configured, let's run the workflow to test if everything works well.

airflow webserver
airflow scheduler

Follow the steps on steps on running the database dump step.

Booom!!!! Congratulations, your database dump should now be uploaded to the backup server.

Final thoughts:

It is usually a good practice to store backups on storage services such as amazon s3 or digital oceans storage spaces. Using a backup server was purposely for this article. I hope you enjoyed this article.

Happy automation!!!!!!