How to Validate data with an in-memory Checkpoint
This guide will demonstrate how to Validate data using a Checkpoint that is configured and run entirely in-memory. This workflow is appropriate for environments or workflows where a user does not want to or cannot use a Checkpoint Store, e.g. in a hosted environment.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Have a Data Context
- Have an Expectation Suite
- Have a Datasource
- Have a basic understanding of Checkpoints
Reading our guide on Deploying Great Expectations in a hosted environment without file system or CLI is recommended for guidance on the setup, connecting to data, and creating expectations steps that take place prior to this process.
Steps
1. Import the necessary modules
The recommended method for creating a Checkpoint is to use the CLI to open a Jupyter Notebook which contains code scaffolding to assist you with the process. Since that option is not available (this guide is assuming that your need for an in-memory Checkpoint is due to being unable to use the CLI or access a filesystem) you will have to provide that scaffolding yourself.
In the script that you are defining and executing your Checkpoint in, enter the following code:
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
Importing great_expectations
will give
you access to your Data Context, while we will
configure an instance of the
Checkpoint
class as our in-memory
Checkpoint.
If you are planning to use a YAML string to configure
your in-memory Checkpoint you will also need to import
yaml
from ruamel
:
from ruamel import yaml
You will also need to initialize
yaml.YAML(...)
:
yaml = yaml.YAML(typ="safe")
2. Initialize your Data Context
In the previous section you imported
great_expectations
in order to get access
to your Data Context. The line of code that does this
is:
context = gx.get_context()
Checkpoints require a Data Context in order to access
necessary Stores from which to retrieve Expectation
Suites and store Validation Results and Metrics, so
you will pass context
in as a parameter
when you initialize your Checkpoint
class
later.
3. Define your Checkpoint configuration
In addition to a Data Context, you will need a configuration with which to initialize your Checkpoint. This configuration can be in the form of a YAML string or a Python dictionary, The following examples show configurations that are equivalent to the one used by the Getting Started Tutorial.
Normally, a Checkpoint configuration will include the
keys class_name
and
module_name
. These are used by Great
Expectations to identify the class of Checkpoint that
should be initialized with a given configuration.
Since we are initializing an instance of the
Checkpoint
class directly we don't
need the configuration to indicate the class of
Checkpoint to be initialized. Therefore, these two
keys will be left out of our configuration.
- Python Dictionary
- YAML String
my_checkpoint_name = "version-0.15.50 in_memory_checkpoint"
python_config = {
"name": my_checkpoint_name,
"config_version": 1,
"run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
"action_list": [
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{
"name": "store_evaluation_params",
"action": {"class_name": "StoreEvaluationParametersAction"},
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction", "site_names": []},
},
],
"validations": [
{
"batch_request": {
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata_sample_2019-01",
"data_connector_query": {"index": -1},
},
"expectation_suite_name": "my_expectation_suite",
}
],
}
my_checkpoint_name = "version-0.15.50 in_memory_checkpoint"
yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
run_name_template: '%Y%m%d-%H%M%S-my-run-name-template'
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
site_names: []
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
"""
When you are tailoring the configuration for your own
purposes, you will want to replace the Batch Request
and Expectation Suite under the
validations
key with your own values. You
can further edit the configuration to add additional
Batch Request and Expectation Suite entries under the
validations
key. Alternatively, you can
even replace this configuration entirely and build one
from scratch. If you choose to build a configuration
from scratch, or to further modify the examples
provided above, you may wish to reference
our documentation on Checkpoint configurations
as you do.
4. Initialize your Checkpoint
Once you have your Data Context and Checkpoint
configuration you will be able to initialize a
Checkpoint
instance in memory. There is a
minor variation in how you do so, depending on whether
you are using a Python dictionary or a YAML string for
your configuration.
- Python Dictionary
- YAML String
If you are using a Python dictionary as your
configuration, you will need to unpack it as
parameters for the
Checkpoint
object's
initialization. This can be done with the code:
my_checkpoint = Checkpoint(data_context=context, **python_config)
If you are using a YAML string as your
configuration, you will need to convert it into
a dictionary and unpack it as parameters for the
Checkpoint
object's
initialization. This can be done with the code:
my_checkpoint = Checkpoint(data_context=context, **yaml.load(yaml_config))
5. Run your Checkpoint
Congratulations! You now have an initialized
Checkpoint
object in memory. You can now
use it's run(...)
method to Validate
your data as specified in the configuration.
This will be done with the line:
results = my_checkpoint.run()
Congratulations! Your script is now ready to be run. Each time you run it, it will initialize and run a Checkpoint in memory, rather than retrieving a Checkpoint configuration from a Checkpoint Store.
6. Check your Data Docs
Once you have run your script you can verify that it has worked by checking your Data Docs for new results.
Notes
To view the full example scripts used in this documentation, see: