How to create a new Expectation Suite using Rule Based Profilers
In this tutorial, you will develop hands-on experience with configuring a Rule-Based ProfilerGenerates Metrics and candidate Expectations from data. to create an Expectation SuiteA collection of verifiable assertions about data.. You will ProfileThe act of generating Metrics and candidate Expectations from data. several BatchesA selection of records from a Data Asset. of NYC yellow taxi trip data to come up with reasonable estimates for the ranges of ExpectationsA verifiable assertion about data. for several numeric columns.
Please note that Rule Based Profiler is currently undergoing development and is considered an experimental feature. While the contents of this document accurately reflect the state of the feature, they are susceptible to change.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Have a basic understanding of MetricsA computed attribute of data such as the mean of a column. in Great Expectations.
- Have a basic understanding of Expectation Configurations in Great Expectations.
- Have read the overview of ProfilersGenerates Metrics and candidate Expectations from data. and the section on Rule-Based Profilers in particular.
Steps
1. Create a new Great Expectations project
-
Create a new directory, called
taxi_profiling_tutorial
-
Within this directory, create another directory
called
data
-
Navigate to the top level of
taxi_profiling_tutorial
in a terminal and rungreat_expectations init
2. Download the data
-
Download
this directory
of yellow taxi trip
csv
files from the Great Expectations GitHub repo. You can use a tool like DownGit to do so -
Move the unzipped directory of
csv
files into thedata
directory that you created in Step 1
3. Set up your Datasource
-
Follow the steps in the
How to connect to data on a filesystem using
Pandas. For the purpose of this tutorial, we will work
from a
yaml
to set up your DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. config. When you open up your notebook to create and test and save your Datasource config, replace the config docstring with the following docstring:
example_yaml = f"""
name: taxi_pandas
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
monthly:
base_directory: ../<YOUR BASE DIR>/
glob_directive: '*.csv'
class_name: ConfiguredAssetFilesystemDataConnector
assets:
my_reports:
base_directory: ./
group_names:
- name
- year
- month
class_name: Asset
pattern: (.+)_(\d.*)-(\d.*)\.csv
"""
-
Test your YAML config to make sure it works - you
should see some of the taxi
csv
filenames listed - Save your Datasource config
4. Configure the Profiler
-
Now, we'll create a new script in the same
top-level
taxi_profiling_tutorial
directory calledprofiler_script.py
. If you prefer, you could open up a Jupyter Notebook and run this there instead. -
At the top of this file, we will create a new YAML
docstring assigned to a variable called
profiler_config
. This will look similar to the YAML docstring we used above when creating our Datasource. Over the next several steps, we will slowly add lines to this docstring by typing or pasting in the lines below:
profiler_config = """
"""
First, we'll add some relevant top level keys
(name
and config_version
) to
label our Profiler and associate it with a specific
version of the feature.
name: My Profiler
config_version: 1.0
Note that at the time of writing this document,
1.0
is the only supported config
version.
Then, we'll add in a Variables
key
and some variables that we'll use. Next,
we'll add a top level rules
key, and
then the name of your rule
:
variables:
false_positive_rate: 0.01
mostly: 1.0
rules:
row_count_rule:
After that, we'll add our Domain Builder. In this
case, we'll use a
TableDomainBuilder
, which will indicate
that any expectations we build for this Domain will be
at the Table level. Each Rule in our Profiler config
can only use one Domain Builder.
domain_builder:
class_name: TableDomainBuilder
Next, we'll use a
NumericMetricRangeMultiBatchParameterBuilder
to get an estimate to use for the
min_value
and max_value
of
our
expect_table_row_count_to_be_between
Expectation. This Parameter Builder will take in a
Batch RequestProvided to a Datasource in order to create a
Batch.
consisting of the five Batches prior to our current
Batch, and use the row counts of each of those months
to get a probable range of row counts that you could
use in your ExpectationConfiguration
.
parameter_builders:
- name: row_count_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: table.row_count
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
truncate_values:
lower_bound: 0
round_decimals: 0
A Rule can have multiple
ParameterBuilders
if needed, but in our
case, we'll only use the one for now.
Finally, you would use an
ExpectationConfigurationBuilder
to
actually build your
expect_table_row_count_to_be_between
Expectation, where the Domain is the Domain returned
by your TableDomainBuilder
(your entire
table), and the min_value
and
max_value
are Parameters returned by your
NumericMetricRangeMultiBatchParameterBuilder
.
expectation_configuration_builders:
- expectation_type: expect_table_row_count_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
min_value: $parameter.row_count_range.value[0]
max_value: $parameter.row_count_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.row_count_range.details
You can see here that we use a special
$
syntax to reference
variables
and
parameters
that have been previously
defined in our config. You can see a more thorough
description of this syntax in the docstring for
ParameterContainer
here.
-
When we put it all together, here is what our config
with our single
row_count_rule
looks like:
# This profiler is meant to be used on the NYC taxi data (yellow_tripdata_sample_<year>-<month>.csv)
# located in tests/test_sets/taxi_yellow_tripdata_samples/
name: My Profiler
config_version: 1.0
variables:
false_positive_rate: 0.01
mostly: 1.0
rules:
row_count_rule:
domain_builder:
class_name: TableDomainBuilder
parameter_builders:
- name: row_count_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: table.row_count
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
truncate_values:
lower_bound: 0
round_decimals: 0
expectation_configuration_builders:
- expectation_type: expect_table_row_count_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
min_value: $parameter.row_count_range.value[0]
max_value: $parameter.row_count_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.row_count_range.details
5. Run the Profiler
Now let's use our config to Profile our data and create a simple Expectation Suite!
First we'll do some basic set-up - set up a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. and parse our YAML
data_context = DataContext()
full_profiler_config_dict: dict = yaml.load(profiler_config)
Next, we'll instantiate our Profiler, passing in our config and our Data Context
rule_based_profiler: RuleBasedProfiler = RuleBasedProfiler(
name=full_profiler_config_dict["name"],
config_version=full_profiler_config_dict["config_version"],
rules=full_profiler_config_dict["rules"],
variables=full_profiler_config_dict["variables"],
data_context=data_context,
)
Finally, we'll run the profiler and save the result to a variable.
batch_request: dict = {
"datasource_name": "taxi_pandas",
"data_connector_name": "monthly",
"data_asset_name": "my_reports",
"data_connector_query": {
"index": "-6:-1",
},
}
result: RuleBasedProfilerResult = rule_based_profiler.run(batch_request=batch_request)
Then, we can print our Expectation Suite so we can see how it looks!
row_count_rule_suite = """
{
"meta": {"great_expectations_version": "0.13.19+58.gf8a650720.dirty"},
"data_asset_type": None,
"expectations": [
{
"kwargs": {"min_value": 10000, "max_value": 10000, "mostly": 1.0},
"expectation_type": "expect_table_row_count_to_be_between",
"meta": {
"profiler_details": {
"metric_configuration": {
"metric_name": "table.row_count",
"metric_domain_kwargs": {},
}
}
},
}
],
"expectation_suite_name": "tmp_suite_Profiler_e66f7cbb",
}
"""
6. Add a Rule for Columns
Let's add one more rule to our Rule-Based
Profiler config. This Rule will use the
DomainBuilder
to populate a list of all
of the numeric columns in one Batch of taxi data (in
this case, the most recent Batch). It will then use
our
NumericMetricRangeMultiBatchParameterBuilder
looking at the five Batches prior to our most recent
Batch to get probable ranges for the min and max
values for each of those columns. Finally, it will use
those ranges to add two
ExpectationConfigurations
for each of
those columns:
expect_column_min_to_be_between
and
expect_column_max_to_be_between
. This
rule will go directly below our previous rule.
As before, we will first add the name of our rule, and
then specify the DomainBuilder
.
column_ranges_rule:
domain_builder:
class_name: ColumnDomainBuilder
include_semantic_types:
- numeric
In this case, our
DomainBuilder
configuration is a bit more
complex. First, we are using a
SimpleSemanticTypeColumnDomainBuilder
.
This will take a table, and return a list of all
columns that match the
semantic_type
specified -
numeric
in our case.
Then, we need to specify a Batch Request that returns
exactly one Batch of data (this is our
data_connector_query
with
index
equal to -1
). This
tells us which Batch to use to get the columns from
which we will select our numeric columns. Though we
might hope that all our Batches of data have the same
columns, in actuality, there might be differences
between the Batches, and so we explicitly specify the
Batch we want to use here.
After this, we specify our
ParameterBuilders
. This is very similar
to the specification in our previous rule, except we
will be specifying two
NumericMetricRangeMultiBatchParameterBuilders
to get a probable range for the
min_value
and max_value
of
each of our numeric columns. Thus one
ParameterBuilder
will take the
column.min
metric_name
, and
the other will take the column.max
metric_name
.
parameter_builders:
- name: min_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: column.min
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
round_decimals: 2
- name: max_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
round_decimals: 2
Finally, we'll put together our
Domains
and Parameters
in
our ExpectationConfigurationBuilders
:
expectation_configuration_builders:
- expectation_type: expect_column_min_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
column: $domain.domain_kwargs.column
min_value: $parameter.min_range.value[0]
max_value: $parameter.min_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.min_range.details
- expectation_type: expect_column_max_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
column: $domain.domain_kwargs.column
min_value: $parameter.max_range.value[0]
max_value: $parameter.max_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.max_range.details
Putting together our entire config, with both of our Rules, we get:
# This profiler is meant to be used on the NYC taxi data (yellow_tripdata_sample_<year>-<month>.csv)
# located in tests/test_sets/taxi_yellow_tripdata_samples/
name: My Profiler
config_version: 1.0
variables:
false_positive_rate: 0.01
mostly: 1.0
rules:
row_count_rule:
domain_builder:
class_name: TableDomainBuilder
parameter_builders:
- name: row_count_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: table.row_count
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
truncate_values:
lower_bound: 0
round_decimals: 0
expectation_configuration_builders:
- expectation_type: expect_table_row_count_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
min_value: $parameter.row_count_range.value[0]
max_value: $parameter.row_count_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.row_count_range.details
column_ranges_rule:
domain_builder:
class_name: ColumnDomainBuilder
include_semantic_types:
- numeric
parameter_builders:
- name: min_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: column.min
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
round_decimals: 2
- name: max_range
class_name: NumericMetricRangeMultiBatchParameterBuilder
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
false_positive_rate: $variables.false_positive_rate
round_decimals: 2
expectation_configuration_builders:
- expectation_type: expect_column_min_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
column: $domain.domain_kwargs.column
min_value: $parameter.min_range.value[0]
max_value: $parameter.min_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.min_range.details
- expectation_type: expect_column_max_to_be_between
class_name: DefaultExpectationConfigurationBuilder
module_name: great_expectations.rule_based_profiler.expectation_configuration_builder
column: $domain.domain_kwargs.column
min_value: $parameter.max_range.value[0]
max_value: $parameter.max_range.value[1]
mostly: $variables.mostly
meta:
profiler_details: $parameter.max_range.details
And if we re-instantiate our
Profiler
with our config which now has
two rules, and then we re-run the
Profiler
, we'll have an updated
Expectation Suite with a table row count Expectation
for our table, and column min and column max
Expectations for each of our numeric columns!
🚀Congratulations! You have successfully Profiled
multi-batch data using a Rule-Based Profiler. Now you
can try adding some new Rules, or running your
Profiler on some other data (remember to change the
BatchRequest
in your config)!🚀
Additional Notes
To view the full script used in this page, see it on GitHub: