Data Quality with Great Expectations in Python: Effective Code Examples

Claude Paugh
1 day ago
5 min read

In data science and analytics, ensuring data quality is extremely important. Poor data quality can result in misleading insights, poor decisions, and a significant loss of trust in data-driven processes. A valuable tool to help data professionals maintain high standards of data accuracy is the Great Expectations framework. In this post, we will explore how to implement Great Expectations in Python(GX Requires Python, version 3.9 to 3.13), along with practical code examples to help you master data quality in your projects.

What is Great Expectations?

Great Expectations is an open-source Python library designed to assist data teams in creating, managing, and maintaining data quality expectations. This framework enables users to define specific expectations about their data and validate datasets against these expectations. One key feature is its ability to generate documentation that communicates data quality metrics effectively. By utilizing Great Expectations, data teams can ensure their data is not only accurate but also complete and reliable.

For instance, you might specify that a column should have unique values or that a numerical column must fall within a defined range. Automating the validation of these expectations allows teams to spot issues early, preventing any harmful impact on analyses.

Setting Up Great Expectations

To begin using Great Expectations in your Python environment, you can easily install the library using pip:

-->bash
pip install great_expectations

Once installed, create a new Great Expectations project by executing the following command in your terminal:

-->bash
great_expectations init

This command creates a directory named `great_expectations` in your current working directory, complete with the necessary files and folders to get started.

Creating a Data Context

The Data Context serves as the central configuration for your Great Expectations project, encompassing all configurations, expectations, and data sources. To create a Data Context, navigate to the `great_expectations` directory and run:

-->bash
great_expectations datasource new

During this setup, you will connect to various data sources. For example, you can connect to popular SQL databases, read data from CSV files, or even pull data from cloud storage solutions like Amazon S3 and Google Cloud Storage, and read Apache Spark and Pandas dataframes.

Defining Expectations

Once your Data Context is established, you can start setting expectations for your data. Let's consider you have a CSV file of user data, and you want to verify that the `email` column contains valid email addresses. Here's how you can define this expectation:

-->python
import great_expectations as ge
data = ge.read_csv("path/to/your/user_data.csv")

Create a new Expectation Suite

suite = data.create_expectation_suite("user_data_expectations")

Define an expectation for the email column

data.expect_column_values_to_be_in_set("email", ["valid_email@example.com", "another_valid@example.com"])

In this scenario, we load user data from a CSV file, create a new expectation suite, and specify that the `email` column should match specific valid addresses.

Validating Data

After you set your expectations, the next step is to validate your data against these claims. Use the following code for validation:

-->python
results = data.validate(expectation_suite_name="user_data_expectations")
print(results)

The `validate` method generates a results object, revealing which expectations passed and which did not. This enables swift identification of any data quality concerns.

Trigger Actions on Validation Results

Great Expectations provides Actions for common workflows for common workflow action triggering, such as sending emails and updating data documents. If these don't meet your needs, you can create a custom Action to integrate with different tools or apply custom business logic based on Validation Results.

Example use cases for custom Actions include:

Creating tickets in an issue tracker when Validation runs fail.
Triggering different webhooks
Running additional ETL jobs to back-fill in missing values.

A custom Action can handle anything that can be done with Python code. There is also a concept of "checkpointing", where specific validation item or set, can trigger a series of Actions on either success or failure of the Validation run.

Generating Documentation

A standout feature of Great Expectations is its capability to create documentation for your established expectations. This documentation is beneficial for sharing data quality metrics with stakeholders. To generate documentation, run:

-->bash
great_expectations suite edit user_data_expectations

This command opens a web interface to view and edit your expectations. You can also produce a static HTML report by executing:

-->bash
great_expectations docs build

These commands collectively build a `docs` directory that contains the generated documentation, enhancing visibility for all stakeholders.

Eye-level view of a data visualization dashboard — A data visualization dashboard showcasing various metrics.

Advanced Expectations

Great Expectations supports a range of expectations that go beyond simple checks. Here are a few advanced examples you might consider:

Checking Column Values Against a Regular Expression

If you want to verify that all email addresses in the `email` column are valid, you can apply a regular expression:

-->python
data.expect_column_values_to_match_strictly_regex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")

Ensuring Column Values Are Unique

To confirm that a column comprises unique values, you can use the following expectation:

-->python
data.expect_column_values_to_be_unique("user_id")

Validating Numeric Ranges

To ensure that a numerical column, such as age, stays within a specific range, consider this example:

-->python
data.expect_column_values_to_be_between("age", min_value=18, max_value=100)

Integrating Great Expectations with Data Pipelines

Great Expectations can be seamlessly integrated into your data pipelines. Suppose you use Apache Airflow; you can easily create a task that validates your data with Great Expectations. Here is a simple example:

-->python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import great_expectations as ge


def validate_data():
    data = ge.read_csv("path/to/your/user_data.csv")
    results = data.validate(expectation_suite_name="user_data_expectations")
    if not results["success"]:
        raise ValueError("Data validation failed!")

dag = DAG('data_validation_dag', start_date=datetime(2023, 1, 1))

validate_task = PythonOperator(
    task_id='validate_data',
    python_callable=validate_data,
    dag=dag,
)

validate_task

This snippet defines an Airflow DAG that defines a task to validate data. If validation fails, an error is raised, which can prompt alerts or trigger other necessary actions in your data pipeline.

Ongoing Data Quality Monitoring

Data quality is not a one-time effort; it requires continuous oversight. Great Expectations offers tools to help you consistently track your data's quality. You can establish a monitoring system that regularly validates your data and alerts you of emerging issues. Using Actions as part of your monitoring, can automate notifications, save data, or run backfill jobs.

For example, by scheduling a daily job to run your validation scripts, you can record the results systematically. This helps track trends in data quality over time and addresses any problems before they escalate.

Wrapping Up

By implementing the Great Expectations framework in your Python projects, you can significantly enhance your approach to data quality management. Defining expectations, validating data, and generating documentation ensures your data remains accurate and trustworthy.

The code examples provided in this post lay a solid groundwork for utilizing Great Expectations in your own initiatives. Keep in mind that maintaining data quality is an ongoing journey, and tools like Great Expectations are invaluable in achieving high standards for your data-driven efforts.

As you explore Great Expectations further, think about integrating it into your data pipelines and monitoring systems to safeguard the trustworthiness of your data over time. Happy coding!