Create an Expectation Suite with the Missingness Data Assistant
Missingness Data Assistant functionality is Experimental.
Use the information provided here to learn how you can use the Missingness Data Assistant to profile your data and automate the creation of an Expectation Suite.
All the code used in the examples is available in GitHub at this location: how_to_create_an_expectation_suite_with_the_missingness_data_assistant.py.
Prerequisites
- A configured Data Context.
- An understanding of how to configure a Datasource.
- An understanding of how to configure a Batch Request.
Prepare your Datasource and Validator
In the following examples, you'll be using existing New York taxi trip data to create a Validator.
This is the Datasource
configuration:
datasource = context.sources.add_pandas_filesystem(
name="taxi_multi_batch_datasource", # custom name to assign to new datasource, can be used to retrieve datasource later
base_directory="./data", # replace with your data directory
)
This is the Validator
configuration:
validator = datasource.read_csv(
asset_name="all_years", # custom name to assign to the asset, can be used to retrieve asset later
batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv",
)
The Missingness Data Assistant runs multiple
queries against your Datasource
. Data
Assistant performance can vary significantly
depending on the number of Batches, the number of
records per Batch, and network latency. If Data
Assistant runtimes are too long, use a subset of
your data when defining your
Datasource
and
Validator
.
Run the Missingness Data Assistant
To run a Data Assistant, you can call the
run(...)
method for the assistant. There
are numerous parameters available for the
run(...)
method of the Missingness Data
Assistant. For instance, the
exclude_column_names
parameter allows you
to define the columns that should not be Profiled.
-
Run the following code to define the columns to exclude:
exclude_column_names = [
"VendorID",
"pickup_datetime",
"dropoff_datetime",
"RatecodeID",
"PULocationID",
"DOLocationID",
"payment_type",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"improvement_surcharge",
"congestion_surcharge",
] -
Run the following code to run the Missingness Data Assistant:
data_assistant_result = context.assistants.missingness.run(
validator=validator,
exclude_column_names=exclude_column_names,
)In this example,
context
is your Data Context instance.noteThe example code uses the default
estimation
parameter ("exact"
).If you consider your data to be valid, and want to produce Expectations with ranges that are identical to the data in the
Validator
, you don't need to alter the example code.To identify potential outliers in your
BatchRequest
data, passestimation="flag_outliers"
to therun(...)
method.noteThe Missingness Data Assistant
run(...)
method can accept other parameters in addition toexclude_column_names
such asinclude_column_names
,include_column_name_suffixes
, andcardinality_limit_mode
. To view the available parameters, see this information.
Save your Expectation Suite
After executing the Missingness Data Assistant's
run(...)
method and generating
Expectations for your data, run the following code to
generate an Expectation Suite and save it to your
Validator:
validator.expectation_suite = data_assistant_result.get_expectation_suite(
expectation_suite_name="my_custom_expectation_suite_name"
)
validator.save_expectation_suite(discard_failed_expectations=False)
Test your Expectation Suite
Run the following code to use a Checkpoint to operate with the Expectation Suite and Validator that you defined:
checkpoint = context.add_or_update_checkpoint(
name="yellow_tripdata_sample_all_years_checkpoint",
validator=validator,
)
checkpoint_result = checkpoint.run()
assert checkpoint_result["success"] is True
You can check the "success"
key
of the Checkpoint's results to verify that your
Expectation Suite worked.
Plot and inspect Metrics and Expectations
-
Run the following code to view Batch-level visualizations of the Metrics computed by the Missingness Data Assistant:
data_assistant_result.plot_metrics()
noteHover over a data point to view more information about the Batch and its calculated Metric value.
-
Run the following command to view the Expectations produced and grouped by Expectation type:
data_assistant_result.show_expectations_by_expectation_type()
Edit your Expectation Suite (Optional)
The Missingness Data Assistant creates as many Expectations as it can for the permitted columns. Although this can help with data analysis, it might be unnecessary. You might have some domain knowledge that is not reflected in the data that was sampled for the Profiling process. In these types of scenarios, you can edit your Expectation Suite to better align with your business requirements.
Run the following code to edit an existing Expectation Suite:
great_expectations suite edit <expectation_suite_name>
A Jupyter Notebook will open. You can review, edit, and save changes to the Expectation Suite.