Create an Expectation Suite with the Onboarding Data Assistant
Use the information provided here to learn how you can use the Onboarding Data Assistant to Profile your data and automate the generation of an Expectation Suite.
All the code used in the examples is available in GitHub at this location: how_to_create_an_expectation_suite_with_the_onboarding_data_assistant.py.
Prerequisites
- A configured Data Context.
- An understanding of how to configure a Data Source.
- An understanding of how to configure a Batch Request.
Prepare your Batch Request
In the following examples, you'll be using a Batch Request with multiple Batches and the Datasource that the Batch Request queries uses existing New York taxi trip data.
This is the Datasource
configuration:
context.sources.add_pandas_filesystem(
"taxi_multi_batch_datasource",
base_directory="./data", # replace with your data directory
).add_csv_asset(
"all_years",
batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv",
)
This is the BatchRequest
configuration:
all_years_asset: DataAsset = context.datasources[
"taxi_multi_batch_datasource"
].get_asset("all_years")
multi_batch_all_years_batch_request: BatchRequest = (
all_years_asset.build_batch_request()
)
The Onboarding Data Assistant runs multiple
queries against your Datasource
. Data
Assistant performance can vary significantly
depending on the number of Batches, the number of
records per Batch, and network latency. If Data
Assistant runtimes are too long, use a smaller
BatchRequest
. You can also run the
Onboarding Data Assistant on a single Batch when
you expect the number of null records to be
similar across Batches.
Prepare a new Expectation Suite
Run the following code to prepare a new Expectation
Suite with the Data Context
add_expectation_suite(...)
method:
expectation_suite_name = "my_onboarding_assistant_suite"
expectation_suite = context.add_or_update_expectation_suite(
expectation_suite_name=expectation_suite_name
)
Run the Onboarding Data Assistant
To run a Data Assistant, you can call the
run(...)
method for the assistant.
However, there are numerous parameters available for
the run(...)
method of the Onboarding
Data Assistant. For instance, the
exclude_column_names
parameter allows you
to define the columns that should not be Profiled.
-
Run the following code to define the columns to exclude:
exclude_column_names = [
"VendorID",
"pickup_datetime",
"dropoff_datetime",
"RatecodeID",
"PULocationID",
"DOLocationID",
"payment_type",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"improvement_surcharge",
"congestion_surcharge",
] -
Run the following code to run the Onboarding Data Assistant:
data_assistant_result = context.assistants.onboarding.run(
batch_request=multi_batch_all_years_batch_request,
exclude_column_names=exclude_column_names,
)In this example,
context
is your Data Context instance.noteIf you consider your
BatchRequest
data valid, and want to produce Expectations with ranges that are identical to the data in theBatchRequest
, you don't need to alter the example code. You're using the defaultestimation
parameter ("exact"
). To identify potential outliers in yourBatchRequest
data, passestimation="flag_outliers"
to therun(...)
method.noteThe Onboarding Data Assistant
run(...)
method can accept other parameters in addition toexclude_column_names
such asinclude_column_names
,include_column_name_suffixes
, andcardinality_limit_mode
. To view the available parameters, see this information.
Save your Expectation Suite
-
After executing the Onboarding Data Assistant
run(...)
method and generating Expectations for your data, run the following code to load and save them into your Expectation Suite:expectation_suite = data_assistant_result.get_expectation_suite(
expectation_suite_name=expectation_suite_name
) -
Run the following code to save the Expectation Suite:
context.add_or_update_expectation_suite(expectation_suite=expectation_suite)
Test your Expectation Suite
Run the following code to use a Checkpoint to operate with the Expectation Suite and Batch Request that you defined:
checkpoint = context.add_or_update_checkpoint(
name=f"yellow_tripdata_sample_{expectation_suite_name}",
validations=[
{
"batch_request": multi_batch_all_years_batch_request,
"expectation_suite_name": expectation_suite_name,
}
],
)
checkpoint_result = checkpoint.run()
assert checkpoint_result["success"] is True
You can check the "success"
key
of the Checkpoint's results to verify that your
Expectation Suite worked.
Plot and inspect Metrics and Expectations
-
Run the following code to view Batch-level visualizations of the Metrics computed by the Onboarding Data Assistant:
data_assistant_result.plot_metrics()
noteHover over a data point to view more information about the Batch and its calculated Metric value.
-
Run the following code to view all Metrics computed by the Onboarding Data Assistant:
data_assistant_result.metrics_by_domain
-
Run the following code to plot the Expectations and the associated Metrics calculated by the Onboarding Data Assistant:
data_assistant_result.plot_expectations_and_metrics()
noteThe Expectation and the Metric are not visualized by the
plot_expectations_and_metrics()
method when an Expectation is not produced by the Onboarding Data Assistant for a given Metric. -
Run the following code to view the Expectations produced and grouped by Domain:
data_assistant_result.show_expectations_by_domain_type(
expectation_suite_name=expectation_suite_name
) -
Run the following code to view the Expectations produced and grouped by Expectation type:
data_assistant_result.show_expectations_by_expectation_type(
expectation_suite_name=expectation_suite_name
)
Edit your Expectation Suite (Optional)
The Onboarding Data Assistant creates as many Expectations as it can for the permitted columns. Although this can help with data analysis, it might be unnecessary. It is also possible that you may possess some domain knowledge that is not reflected in the data that was sampled for the Profiling process. In these types of scenarios, you can edit your Expectation Suite to better align with your business requirements.
To edit an existing Expectation Suite, see Edit an Expectation Suite.