Data quality has become a much-discussed topic in the fields of data engineering and data science, and it has become clear that ensuring data quality is absolutely crucial to avoiding a case of “garbage in – garbage out”. Apache Airflow and dbt (data build tool) are some of the most prominent tools in the open-source data engineering ecosystem, and while dbt offers some data testing capabilities, enhancing the pipeline with data validation through the open-source framework Great Expectations can add additional layers of robustness.

This blog post explains why data validation is crucial for data teams, provides a brief introduction to Great Expectations, and outlines a convenient pattern for using these tools together to build a robust data pipeline in what we’ve been calling the “dAG stack.”

Why you should test your data

“Our stakeholders would notice data issues before we did… which really eroded trust in the data and our team.” (A Great Expectations user)

Despite the growing sophistication of tools in the data engineering ecosystem in recent years, managing data quality in data pipelines is still often a struggle for data teams: On the stakeholder side, poor data quality affects the trust stakeholders have in a system, which negatively impacts the ability to make decisions based on it. Or even worse, data quality issues that go unnoticed might lead to incorrect conclusions and wasted time rectifying those problems. On the engineering side, scrambling to fix data quality problems that were noticed by downstream consumers is one of the number one issues that cost teams time and slowly erodes team productivity and morale. It’s clear that ensuring data quality through data validation is absolutely necessary for the success of any data-driven organization – but with almost limitless possibilities of what can be tested and how, what are some guiding principles for building a robust data pipeline?

An open-source data engineering stack for robust data pipelines

While there is a large number of data engineering frameworks have established themselves as leaders in the modern open-source data stack: 

  • dbt (data build tool) is a framework that allows data teams to quickly iterate on building data transformation pipelines using templated SQL. 
  • Apache Airflow is a workflow orchestration tool that enables users to define complex workflows as “DAGs” (directed acyclic graphs) made up of various tasks, as well as schedule and monitor execution.
  • Great Expectations is a python-based open-source data validation and documentation framework.

What is Great Expectations?

With Great Expectations, data teams can express what they “expect” from their data using simple assertions. Great Expectations provides support for different data backends such as flat file formats, SQL databases, Pandas dataframes and Sparks, comes with built-in notification and data documentation functionality.

For example, in order to express that a column “passenger_count” should have integer values between 1 and 6, one would write:

expect_column_values_to_be_between(
column=”passenger_count”, min_value=1, max_value=6
)

These “Expectations” are typically created in an “interactive” mode in which users connect to an existing data set as a “reference” and receive immediate feedback on whether the statement matches what’s observed in the dataset. The Expectations are then stored and can be loaded every time a new “batch” of data is being validated. In addition to the manual creation of tests, Great Expectations also offers a built-in “scaffolding” mode, which profiles a reference data set and creates Expectations based on what is observed in the data. This allows users to quickly build an entire suite of tests for a dataset without going through an entirely manual process.

Putting it all together

A typical pipeline using this “dAG” stack may look like the above image: implement initial data validation of source data (e.g. a CSV file on a web server, or a table in another database) with a Great Expectations Airflow operator, load the data using Python tasks in the Airflow DAG, validate that the data was loaded correctly with dbt or Great Expectations, then execute transformations with a dbt task, test the correctness of those transformations with dbt or Great Expectations, notify users and/or halt the pipeline if any data quality issues are detected, then load the data to a production environment with another Python task. Great Expectations also automatically generates data quality reports (“Data Docs”) at each validation run that can be shared with data stakeholders.

The code snippet below shows a simplified Airflow DAG definition with a task for each of the steps mentioned above. It uses both the Great Expectations and dbt Airflow operators, which provide convenient interfaces to the core functionality of both tools. The Great Expectations tasks make use of pre-configured Checkpoints, which bundle a data asset with an Expectation Suite.

import airflow
from airflow_dbt import DbtRunOperator
from airflow.operators.python import PythonOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator

default_args = {
   "owner": "Airflow",
   "start_date": airflow.utils.dates.days_ago(1)
}

dag = airflow.DAG(
   dag_id='sample_dag',
   default_args=default_args,
   schedule_interval=None,
)

def load_source_data():
   # Implement load to database

def publish_to_prod():
   # Implement load to production database

task_validate_source_data = GreatExpectationsOperator(
   task_id='validate_source_data',
   checkpoint_name='source_data.chk',
   dag=dag
)

task_load_source_data = PythonOperator(
   task_id='load_source_data',
   python_callable=load_source_data,
   dag=dag,
)

task_validate_source_data_load = GreatExpectationsOperator(
   task_id='validate_source_data_load',
   checkpoint_name='source_data_load.chk',
   dag=dag
)

task_run_dbt_dag = DbtRunOperator(
   task_id='run_dbt_dag',
   dag=dag
)

task_validate_analytical_output = GreatExpectationsOperator(
   task_id='validate_analytical_output',
   checkpoint_name='analytical_output.chk',
   dag=dag
)

task_publish = PythonOperator(
   task_id='publish',
   python_callable=publish_to_prod,
   dag=dag
)

task_validate_source_data.set_downstream(task_load_source_data)
task_load_source_data.set_downstream(task_validate_source_data_load)
task_validate_source_data_load.set_downstream(task_run_dbt_dag)
task_run_dbt_dag.set_downstream(task_validate_analytical_output)
task_validate_analytical_output.set_downstream(task_publish)

Wrapping Up on Great Expectations

In this post, I provided a brief overview of components of a go-to stack for building a robust data pipeline, and gave a short introduction to Great Expectations, an open-source data validation framework. I will go into more detail on the types of tests that can be implemented in such a pipeline in my upcoming talk at ODSC East 2021, titled “Building a Robust Data Pipeline with the “dAG stack”: dbt, Airflow, and Great Expectations.”


About the author/ODSC East 2021 speaker:

Sam Bail is a data professional with a passion for turning high-quality data into valuable insights. Sam holds a PhD in Computer Science and has worked for several data-focused startups. In her current role as Engineering Director at Superconductive, she works on “Great Expectations”, an open-source Python library for data validation and documentation.