Skip to main content
Version: 0.16.16

How to use Great Expectations in Deepnote

In progress

Great Expectations is aware that the fluent datasource and yaml config workflows are undocumented. A new method for connecting to data sources will be added to this topic soon. In the meantime, use the Great Expectations Slack channel if you encounter issues and need assistance.

This piece of documentation was authored by Allan Campopiano.

This guide will help you get started with Great Expectations on Deepnote. You will learn how to validate columns in a Pandas DataFrame, host your data docs, and schedule a pipeline job.

All of this will be accomplished from within a single, ready-to-use notebook, with no prerequisites beyond signing up for a free Deepnote account!

Benefits of Great Expectations in Deepnote

Deepnote provides a "click and play" notebook platform that integrates perfectly with Great Expectations. You can read all about it in this blog post!

Here are some of the notable benefits:

These benefits make Deepnote one of the easiest and fastest ways to get started with Great Expectations.

Steps

1. Begin by importing Great Expectations

Since Great Expectations can be listed in Deepnote's requirements.txt, it will be installed automatically. You can read more about package installation here. This lets us import the required libraries right away.

import pandas as pd
import numpy as np
import great_expectations as gx
from great_expectations.data_context.types.base import (
DataContextConfig,
DatasourceConfig,
FilesystemStoreBackendDefaults,
)
from great_expectations.data_context import BaseDataContext
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.core.batch import RuntimeBatchRequest

2. Initialize Great Expectations

The following cell creates a Great Expectations folder in the filesystem which will hold all of the forthcoming project configurations. Note that if this folder already exists, Great Expectations gracefully allows us to continue.

!great_expectations --yes init

3. Validate a Pandas DataFrame

In practice, this is where you would bring in your own data; however, for the sake of a placeholder, a DataFrame with random values is created. The Expectations we set later on this data may pass or fail.

note

Replace the randomly created DataFrame below with your own datasource.

import pandas as pd

products = np.random.choice(
[
"camera",
"phone",
"computer",
"speaker",
"TV",
"cable",
"movie",
"guitar",
"printer",
],
size=5,
)

quantities = np.random.choice(list(range(10)) + [None], size=5)

dates = np.random.choice(pd.date_range(start="2020-12-30", end="2021-01-08"), size=5)

df = pd.DataFrame({"products": products, "quantities": quantities, "dates": dates})

df.show()

Example DataFrame

4. Define Expectations

Expectations can be defined directly on a Pandas DataFrame using ge.from_pandas(df). We're defining three Expectations on our DataFrame:

  1. The products column must contain unique values
  2. The quantities column cannot contain null values
  3. The dates column must have dates between January 1st and January 8th

These Expectations together form an Expectation SuiteA collection of verifiable assertions about data. that will be validated against our data.

tip

Replace the sample Expectations below with those that relate to your data.

You can see all the Expectations available in the gallery.

df = gx.from_pandas(df)

# ~30% chance of passing
df.expect_column_values_to_be_unique("products") # ~30% chance of passing

# ~60% chance of passing
df.expect_column_values_to_not_be_null("quantities") # ~60% chance of passing

# ~60% chance of passing
df.expect_column_values_to_be_between(
"dates", "2021-01-01", "2021-01-8", parse_strings_as_datetimes=True
);

5. Set project configurations

Before we can validate our expectations against our data, we need to tell Great Expectations more about our project's configuration. Great Expectations keeps track of many configurations with a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.. These configurations are used to manage aspects of your project behind the scenes.

info

There's a lot going on here, but for the sake of this guide we don't need to worry about the full details.

To learn more, visit the Great Expectations docs.

data_context_config = DataContextConfig(
datasources={
"my_datasource": DatasourceConfig(
class_name="version-0.16.16 Datasource",
module_name="version-0.16.16 great_expectations.datasource",
execution_engine={
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
data_connectors={
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
}
},
)
},
store_backend_defaults=FilesystemStoreBackendDefaults(
root_directory="/work/great_expectations"
),
)

context = gx.get_context(project_config=data_context_config)
context.save_expectation_suite(
expectation_suite_name="version-0.16.16 my_expectation_suite",
expectation_suite=df.get_expectation_suite(discard_failed_expectations=False),
);

6. Setting up a Batch and Checkpoint

In order to populate the documentation (Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.) for our tests, we need to set up at least one BatchA selection of records from a Data Asset. and a CheckpointThe primary means for validating data in a production deployment of Great Expectations..

A Batch is a pairing of data and metadata to be validated. A Checkpoint is a bundle of at least:

  • One Batch (the data to be validated)
  • One Expectation Suite (our Expectations for that data)
  • One ActionA Python class with a run method that takes a Validation Result and does something with it (saving our validation results, rebuilding Data Docs, sending a Slack notification, etc.)

In the cell below, one Batch is constructed from our DataFrame with a RuntimeBatchRequestProvided to a Datasource in order to create a Batch..

We then create a Checkpoint, and pass in our batch_request.

When we execute this code, our Expectation Suite is run against our data, validating whether that data meets our Expectations or not. The results are then persisted temporarily until we build our Data Docs.

batch_request = RuntimeBatchRequest(
datasource_name="version-0.16.16 my_datasource",
data_connector_name="version-0.16.16 default_runtime_data_connector_name",
data_asset_name="version-0.16.16 df",
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": "df"},
)

checkpoint_config = {
"name": "my_checkpoint",
"config_version": 1,
"class_name": "SimpleCheckpoint",
"expectation_suite_name": "my_expectation_suite",
}

context.add_checkpoint(**checkpoint_config)

results = context.run_checkpoint(
checkpoint_name="version-0.16.16 my_checkpoint",
validations=[{"batch_request": batch_request}],
run_id="my_run_id",
)

7. Build the documentation

Our Data Docs can now be generated and served (thanks to Deepnote Tunneling!) by running the next cell.

context.build_data_docs();

# Uncomment this line to serve up the documentation
#!python -m http.server 8080 --directory great_expectations/uncommitted/data_docs/local_site

When served, the Data Docs site provides the details of each ValidationThe act of applying an Expectation Suite to a Batch. we've run and Expectation Suite we've created.

For example, the following image shows a run where three Expectations were validated against our DataFrame and two of them failed.

Data Docs

Congratulations!
🎉 You've successfully deployed Great Expectations on Deepnote! 🎉

Summary

Deepnote integrates perfectly with Great Expectations, allowing documentation to be hosted and notebooks to be scheduled. Please visit Deepnote to learn more about how to bring tools, teams, and workflows together.