Validate data with Expectations and Checkpoints
This guide will help you pass an in-memory DataFrame to a CheckpointThe primary means for validating data in a production deployment of Great Expectations. that is defined at runtime. This is especially useful if you already have your data in memory due to an existing process such as a pipeline runner.
The full script used in the following code examples, is available in GitHub here: how_to_pass_an_in_memory_dataframe_to_a_checkpoint.py.
Set up Great Expectations
Run the following command to import the required libraries and load your DataContext
import pandas
import great_expectations as gx
context = gx.get_context()
Read a DataFrame and create a Checkpoint
The following example uses the
read_*
method on the PandasDatasource to
directly return a
ValidatorUsed to run an Expectation Suite against
data.. To use Validators to interactively build an
Expectation Suite, see
How to create Expectations interactively in
Python. The Validator can be passed directly to a
Checkpoint
df = pandas.read_csv("./data/yellow_tripdata_sample_2019-01.csv")
validator = context.sources.add_pandas("taxi_datasource").read_dataframe(
df, asset_name="taxi_frame", batch_metadata={"year": "2019", "month": "01"}
)
validator.save_expectation_suite() # this allows the checkpoint to reference the expectation suite
checkpoint = context.add_or_update_checkpoint(
name="my_taxi_validator_checkpoint", validator=validator
)
checkpoint_result = checkpoint.run()
Alternatively, you can use add_*
methods
to add the asset and then retrieve a
Batch RequestProvided to a Datasource in order to create a
Batch.. This method is consistent with how other Data
Assets work, and can integrate in-memory data with
other Batch Request workflows and configurations.
dataframe_asset = context.sources.add_pandas(
"my_taxi_validator_checkpoint"
).add_dataframe_asset(
name="taxi_frame", dataframe=df, batch_metadata={"year": "2019", "month": "01"}
)
context.add_or_update_expectation_suite("my_expectation_suite")
batch_request = dataframe_asset.build_batch_request()
checkpoint = context.add_or_update_checkpoint(
name="my_taxi_dataframe_checkpoint",
batch_request=batch_request,
expectation_suite_name="my_expectation_suite",
)
checkpoint_result = checkpoint.run()
In both examples, batch_metadata
is an
optional parameter that can associate meta-data with
the batch or DataFrame. When you work with DataFrames,
this can help you distinguish Validation results.