Checkpoints and Actions
Introduction
As part of the new modular expectations API in Great Expectations, Validation Operators are evolving into Checkpoints. At some point in the future Validation Operators will be fully deprecated.
The batch.validate()
method evaluates one
Batch of data against one Expectation Suite and
returns a dictionary of Validation Results. This is
sufficient when you explore your data and get to know
Great Expectations. When deploying Great Expectations
in a real data pipeline, you will typically discover
additional needs:
- Validating a group of Batches that are logically related (for example, a Checkpoint for all staging tables).
-
Validating a Batch against several Expectation
Suites (for example, run three suites to protect a
machine learning model
churn.critical
,churn.warning
,churn.drift
). - Doing something with the Validation Results (for example, saving them for later review, sending notifications in case of failures, etc.).
Checkpoints provide a convenient abstraction for bundling the validation of a Batch (or Batches) of data against an Expectation Suite (or several), as well as the actions that should be taken after the validation. Like Expectation Suites and Validation Results, Checkpoints are managed using a Data Context, and have their own Store which is used to persist their configurations to YAML files. These configurations can be committed to version control and shared with your team.
The classes that implement Checkpoints are in the
great_expectations.checkpoint
module.
Validation Actions
Actions are Python classes with a
run
method that takes the result of
validating a Batch against an Expectation Suite and
does something with it (e.g., save Validation Results
to disk, or send a Slack notification). Classes that
implement this API can be configured to be added to
the list of actions used by a particular Checkpoint.
Classes that implement Actions can be found in the
great_expectations.checkpoint.actions
module.
Checkpoint configuration
A Checkpoint uses its configuration to determine what
data to validate against which Expectation Suite(s),
and what actions to perform on the Validation Results
- these validations and actions are executed by
calling a Checkpoint's run
method
(analogous to calling validate
with a
single Batch). Checkpoint configurations are very
flexible. At one end of the spectrum, you can specify
a complete configuration in a Checkpoint's YAML
file, and simply call
my_checkpoint.run()
. At the other end,
you can specify a minimal configuration in the YAML
file and provide missing keys as kwargs when calling
run
.
At runtime, a Checkpoint configuration has three required and three optional keys, and is built using a combination of the YAML configuration and any kwargs passed in at runtime:
Required keys
-
name
: user-selected Checkpoint name (e.g. "staging_tables") -
config_version
: version number of the Checkpoint configuration -
validations
: a list of dictionaries that describe each validation that is to be executed, including any actions. Each validation dictionary has three required and three optional keys:-
Required keys
-
batch_request
: a dictionary describing the batch of data to validate (learn more about specifying Batches here: Dividing data assets into Batches) -
expectation_suite_name
: the name of the Expectation Suite to validate the batch of data against -
action_list
: a list of actions to perform after each batch is validated
-
-
Optional keys
-
name
: providing a name will allow referencing the validation inside the run by name (e.g. " user_table_validation") -
evaluation_parameters
: used to define named parameters using Great Expectations Evaluation Parameter syntax -
runtime_configuration
: provided to the Validator'sruntime_configuration
(e.g.result_format
)
-
-
Optional keys
-
class_name
: the class of the Checkpoint to be instantiated, defaults toCheckpoint
-
template_name
: the name of another Checkpoint to use as a base template -
run_name_template
: a template to create run names, using environment variables and datetime-template syntax (e.g. " %Y-%M-staging-$MY_ENV_VAR")
Configuration defaults and parameter override behavior
Checkpoint configurations follow a nested pattern,
where more general keys provide defaults for more
specific ones. For instance, any required validation
dictionary keys (e.g.
expectation_suite_name
) can be specified
at the top-level ( i.e. at the same level as the
validations list), serving as runtime defaults.
Starting at the earliest reference template, if a
configuration key is re-specified, its value can be
appended, updated, replaced, or cause an error when
redefined.
Replaced
name
module_name
class_name
run_name_template
expectation_suite_name
Updated
-
batch_request
: at runtime, if a key is re-defined, an error will be thrown -
action_list
: actions that share the same user-defined name will be updated, otherwise a new action will be appended evaluation_parameters
runtime_configuration
Appended
-
action_list
: actions that share the same user-defined name will be updated, otherwise a new action will be appended validations
If the use case calls for instantiating the
Checkpoint explicitly, then it is crucial to
ensure that only serializable values are passed as
arguments to the constructor. Specifically, if
batch_request
is specified at any
level of the hierarchy of the Checkpoint
configuration (at the top level and/or as part of
the validators list structure), then no runtime
batch_request
can contain
batch_data
, only a database query.
This is because batch_data
is used to
specify dataframes (Pandas, Spark), which are not
serializable (while database queries are plain
text, which is serializable).
The proper mechanism for specifying
non-serializable parameters is to pass them
dynamically to the Checkpoint
run()
method. Hence, in a typical
scenario, one would instantiate the Checkpoint
class with serializable parameters only, while
specifying any non-serializable parameters,
commonly dataframes, as arguments to the
Checkpoint run()
method.
SimpleCheckpoint class
For many use cases, the SimpleCheckpoint class can be
used to simplify the process of specifying a
Checkpoint configuration. SimpleCheckpoint provides a
basic set of actions - store Validation Result, store
evaluation parameters, update Data Docs, and
optionally, send a Slack notification - allowing you
to omit an action_list
from your
configuration and at runtime.
Configurations using the SimpleCheckpoint class can optionally specify four additional top-level keys that customize and extend the basic set of default actions:
-
site_names
: a list of Data Docs site names to update as part of the update Data Docs action - defaults to "all" -
slack_webhook
: if provided, an action will be added that sends a Slack notification to the provided webhook -
notify_on
: used to define when a notification is fired, according to Validation Result outcome -all
,failure
, orsuccess
. Defaults toall
. -
notify_with
: a list of Data Docs site names for which to include a URL in any notifications - defaults toall
CheckpointResult
The return object of a Checkpoint run is a
CheckpointResult object. The
run_results
attribute forms the backbone
of this type and defines the basic contract for what a
Checkpoint's run
method returns. It
is a dictionary where the top-level keys are the
ValidationResultIdentifiers of the Validation Results
generated in the run. Each value is a dictionary
having at minimum, a
validation_result
key containing an
ExpectationSuiteValidationResult and an
actions_results
key containing a
dictionary where the top-level keys are names of
actions performed after that particular validation,
with values containing any relevant outputs of that
action (at minimum and in many cases, this would just
be a dictionary with the action's
class_name
).
The run_results
dictionary can contain
other keys that are relevant for a specific Checkpoint
implementation. For example, the
run_results
dictionary from a
WarningAndFailureExpectationSuiteCheckpoint might have
an extra key named "
expectation_suite_severity_level" to indicate if
the suite is at either a "warning" or
"failure" level.
CheckpointResult objects include many convenience
methods (e.g. list_data_asset_names
) that
make working with Checkpoint results easier. You can
learn more about these methods in the documentation
for class:
great_expectations.checkpoint.types.checkpoint_result.CheckpointResult
.
Below is an example of a
CheckpointResult
object which itself
contains ValidationResult
,
ExpectationSuiteValidationResult
, and
CheckpointConfig
objects.
Example CheckpointResult:
results = {
"run_id": RunIdentifier,
"run_results": {
ValidationResultIdentifier: {
"validation_result": ExpectationSuiteValidationResult,
"actions_results": {
"<ACTION NAME FOR STORING VALIDATION RESULTS>": {
"class": "StoreValidationResultAction"
}
},
}
},
"checkpoint_config": CheckpointConfig,
"success": True,
}
Checkpoint configuration default and override behavior
- No nesting
- Nesting with defaults
- Keys passed at runtime
- Using template
- Using SimpleCheckpoint
YAML:
name: my_checkpoint
config_version: 1
class_name: Checkpoint
run_name_template: "%Y-%M-foo-bar-template-$VAR"
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
evaluation_parameters:
GT_PARAM: 1000
LT_PARAM: 50000
runtime_configuration:
result_format:
result_format: BASIC
partial_unexpected_count: 20
runtime:
results = context.run_checkpoint(checkpoint_name="my_checkpoint")
YAML:
name: my_checkpoint
config_version: 1
class_name: Checkpoint
run_name_template: "%Y-%M-foo-bar-template-$VAR"
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-02
expectation_suite_name: my_expectation_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
evaluation_parameters:
GT_PARAM: 1000
LT_PARAM: 50000
runtime_configuration:
result_format:
result_format: BASIC
partial_unexpected_count: 20
Runtime:
results = context.run_checkpoint(checkpoint_name="my_checkpoint")
Results:
first_validation_result = list(results.run_results.items())[0][1]["validation_result"]
second_validation_result = list(results.run_results.items())[1][1]["validation_result"]
first_expectation_suite = first_validation_result["meta"]["expectation_suite_name"]
first_data_asset = first_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
second_expectation_suite = second_validation_result["meta"]["expectation_suite_name"]
second_data_asset = second_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
assert first_expectation_suite == "my_expectation_suite"
assert first_data_asset == "yellow_tripdata_sample_2019-01"
assert second_expectation_suite == "my_expectation_suite"
assert second_data_asset == "yellow_tripdata_sample_2019-02"
# </snippet>
print(first_expectation_suite)
my_expectation_suite
print(first_data_asset)
yellow_tripdata_sample_2019-01
print(second_expectation_suite)
my_expectation_suite
print(second_data_asset)
yellow_tripdata_sample_2019-02
YAML:
name: my_base_checkpoint
config_version: 1
class_name: Checkpoint
run_name_template: "%Y-%M-foo-bar-template-$VAR"
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
evaluation_parameters:
GT_PARAM: 1000
LT_PARAM: 50000
runtime_configuration:
result_format:
result_format: BASIC
partial_unexpected_count: 20
Runtime:
results = context.run_checkpoint(
checkpoint_name="my_base_checkpoint",
validations=[
{
"batch_request": {
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata_sample_2019-01",
},
"expectation_suite_name": "my_expectation_suite",
},
{
"batch_request": {
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata_sample_2019-02",
},
"expectation_suite_name": "my_other_expectation_suite",
},
],
)
Results:
first_validation_result = list(results.run_results.items())[0][1]["validation_result"]
second_validation_result = list(results.run_results.items())[1][1]["validation_result"]
first_expectation_suite = first_validation_result["meta"]["expectation_suite_name"]
first_data_asset = first_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
second_expectation_suite = second_validation_result["meta"]["expectation_suite_name"]
second_data_asset = second_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
assert first_expectation_suite == "my_expectation_suite"
assert first_data_asset == "yellow_tripdata_sample_2019-01"
assert second_expectation_suite == "my_other_expectation_suite"
assert second_data_asset == "yellow_tripdata_sample_2019-02"
documentation_results = """
print(first_expectation_suite)
my_expectation_suite
print(first_data_asset)
yellow_tripdata_sample_2019-01
print(second_expectation_suite)
my_other_expectation_suite
print(second_data_asset)
YAML:
name: my_checkpoint
config_version: 1
class_name: Checkpoint
template_name: my_base_checkpoint
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-02
expectation_suite_name: my_other_expectation_suite
Runtime:
results = context.run_checkpoint(checkpoint_name="my_checkpoint")
Results:
first_validation_result = list(results.run_results.items())[0][1]["validation_result"]
second_validation_result = list(results.run_results.items())[1][1]["validation_result"]
first_expectation_suite = first_validation_result["meta"]["expectation_suite_name"]
first_data_asset = first_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
second_expectation_suite = second_validation_result["meta"]["expectation_suite_name"]
second_data_asset = second_validation_result["meta"]["active_batch_definition"][
"data_asset_name"
]
assert first_expectation_suite == "my_expectation_suite"
assert first_data_asset == "yellow_tripdata_sample_2019-01"
assert second_expectation_suite == "my_other_expectation_suite"
assert second_data_asset == "yellow_tripdata_sample_2019-02"
print(first_expectation_suite)
my_expectation_suite
print(first_data_asset)
yellow_tripdata_sample_2019-01"
print(second_expectation_suite)
my_other_expectation_suite
print(second_data_asset)
yellow_tripdata_sample_2019-02
YAML, using SimpleCheckpoint:
name: my_checkpoint
config_version: 1
class_name: SimpleCheckpoint
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
site_names: all
slack_webhook: <YOUR SLACK WEBHOOK URL>
notify_on: failure
notify_with: all
Runtime:
results = context.run_checkpoint(checkpoint_name="my_checkpoint")
Results:
expectation_suite = validation_result["meta"]["expectation_suite_name"]
data_asset = validation_result["meta"]["active_batch_definition"]["data_asset_name"]
assert expectation_suite == "my_expectation_suite"
assert data_asset == "yellow_tripdata_sample_2019-01"
print(expectation_suite)
my_expectation_suite
Equivalent YAML, using Checkpoint:
equivalent_using_checkpoint = """
name: my_checkpoint
config_version: 1
class_name: Checkpoint
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
- name: send_slack_notification
action:
class_name: SlackNotificationAction
slack_webhook: <YOUR SLACK WEBHOOK URL>
notify_on: failure
notify_with: all
renderer:
module_name: great_expectations.render.renderer.slack_renderer
Runtime:
# <snippet>
Results:
# <snippet>
expectation_suite = validation_result["meta"]["expectation_suite_name"]
data_asset = validation_result["meta"]["active_batch_definition"]["data_asset_name"]
assert expectation_suite == "my_expectation_suite"
documentation_results: str = """
print(expectation_suite)
my_expectation_suite
print(data_asset)
Additional Notes
To view the full script used in this page, see it on GitHub: