airflow-scheduled-job
Install this demo on an existing Kubernetes cluster:
$ stackablectl demo install airflow-scheduled-job
This demo should not be run alongside other demos. |
System requirements
To run this demo, your system needs at least:
-
2.5 cpu units (core/hyperthread)
-
10GiB memory
-
24GiB disk storage
Overview
This demo will
-
Install the required Stackable operators
-
Spin up the following data products
-
Postgresql: An open-source database used for Airflow cluster and job metadata.
-
Redis: An in-memory data structure store used for queuing Airflow jobs
-
Airflow: An open-source workflow management platform for data engineering pipelines.
-
-
Mount two Airflow jobs (referred to as Directed Acyclic Graphs, or DAGs) for the cluster to use
-
Enable and schedule the jobs
-
Verify the job status with the Airflow Webserver UI
You can see the deployed products and their relationship in the following diagram:

List deployed Stackable services
To list the installed Stackable services run the following command:
$ stackablectl stacklet list
┌─────────┬─────────┬───────────┬─────────────────────────────────────────────────┬─────────────────────────────────┐
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
╞═════════╪═════════╪═══════════╪═════════════════════════════════════════════════╪═════════════════════════════════╡
│ airflow ┆ airflow ┆ default ┆ webserver-default-http http://172.19.0.5:30913 ┆ Available, Reconciling, Running │
└─────────┴─────────┴───────────┴─────────────────────────────────────────────────┴─────────────────────────────────┘
When a product instance has not finished starting yet, the service will have no endpoint. Depending on your internet connectivity, creating all the product instances might take considerable time. A warning might be shown if the product is not ready yet. |
Airflow Webserver UI
Superset gives the ability to execute SQL queries and build dashboards. Open the airflow
endpoint webserver-airflow
in your browser (http://172.19.0.5:30913
in this case).

Log in with the username admin
and password adminadmin
.
Click in 'Active DAGs' at the top and you will see an overview showing the DAGs mounted during the demo
setup (date_demo
and sparkapp_dag
).

There are two things to notice here.
Both DAGs have been enabled, as shown by the slider on the far right of the screen for each DAG
(DAGs are all paused
initially and can be activated manually in the UI or via a REST call, as done in the setup for this demo):

Secondly, the date_demo
job has been busy, with several runs already logged.
The sparkapp_dag
has only been run once because they have been defined with different schedules.

Clicking on the DAG name and then on Runs
will display the individual job runs:

The demo_date
job is running every minute.
With Airflow, DAGs can be started manually or scheduled to run when certain conditions are fulfilled - in this case, the DAG has been set up to run using a cron table, which is part of the DAG definition.
demo_date
DAG
Let’s drill down a bit deeper into this DAG. At the top under the DAG name there is some scheduling information, which tells us that this job will run every minute continuously:

Click on one of the job runs in the list to display the details for the task instances.
In the left-side pane the DAG is displayed either as a graph (this job is so simple that it only has one step, called run_every_minute
), or as a "bar chart" showing each run.

Click on the run_every_minute
box in the centre of the page to select the logs:
In this demo, the logs are not available when the KubernetesExecutor is deployed. See the Airflow Documentation for more details. If you are interested in persisting the logs, take a look at the logging demo. |

To look at the actual DAG code click on Code
.
Here we can see the crontab information used to schedule the job as well the bash
command that provides the output:

sparkapp_dag
DAG
Go back to DAG overview screen.
The sparkapp_dag
job has a scheduled entry of None
and a last-execution time.
This allows a DAG to be executed exactly once, with neither schedule-based runs nor any
backfill.
The DAG can always be triggered manually again via REST or from within the Webserver UI.

By navigating to the graphical overview of the job we can see that DAG has two steps, one to start the job - which runs asynchronously - and another to poll the running job to report on its status.
