How to add support for the auto-initializing framework to a custom Expectation
Prerequisites: This how-to guide assumes you have:
Steps
1. Determine if the auto-initializing framework is appropriate to include in your Expectation
Auto-initializing Expectations automate parameter
estimation for Expectations, but not all parameters
require this kind of estimation. If your expectation
only takes in a Domain (such as the name of a column)
then it will not benefit from being configured to work
in the auto-initializing framework. In general, the
auto-initializing Expectation framework benefits those
Expectations that have numeric ranges which are
intended to be descriptive of the data found in a
Batch or Batches. Existing examples of these would be
Expectations such as
ExpectColumnMeanToBeBetween
,
ExpectColumnMaxToBeBetween
, or
ExpectColumnSumToBeBetween
.
2. Build a Rule Based Profiler for your Expectation
In order to automate the estimation of parameters, auto-initializing Expectations utilize a Rule Based Profiler. You will need to create an appropriate configuration for the Rule Based Profiler that your Expectation will use. The easiest way to do this is to modify an existing Rule Based Profiler configuration.
You can find existing Rule Based Profiler
configurations in the source code for any Expectation
that works within the auto-initializing framework. For
this example, we will look at the existing
configuration for the
ExpectColumnMeanToBeBetween
Expectation.
You can
view the source code for this Expectation on our
GitHub.
2a. Modifying variables
Key-value pairs defined in the
variables
portion of a Rule Based
Profiler Configuration will be shared across all of
its Rules
and
Rule
components. This helps you define
and keep track of values without having to input them
multiple times. In our example, the
variables
are:
-
strict_min
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
. -
strict_max
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
. -
false_positive_rate
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Typically, this will be a float0 <= 1.0
. -
quantile_statistic_interpolation_method
: Used byNumericMetricRangeMultiBatchParameterBuilder
, which is used when estimating quantile values (not relevant in our case). Recognized values includeauto
,nearest
, andlinear
. -
estimator
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values includeoneshot
,bootstrap
, andkde
. -
n_resamples
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Integer values are expected. -
include_estimator_samples_histogram_in_details
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values areTrue
orFalse
. -
truncate_values
: A value used by theNumericMetricRangeMultiBatchParameterBuilder
to specify the[lower_bound, upper_bound]
interval, where either boundary is numeric or None. In our case the value is an empty dictionary, and an equivalent configuration would have beentruncate_values : { lower_bound: None, upper_bound: None }
. -
round_decimals
: Used byNumericMetricRangeMultiBatchParameterBuilder
, and determines how many digits after the decimal point to output (in our case 2).
2b. Modifying the domain_builder
The DomainBuilder
configuration requries
a class_name
and
module_name
. In this example, we will be
using the ColumnDomainBuilder
which
outputs the column of interest (for example:
trip_distance
in the NYC taxi data) which
is then accessed by the
ExpectationConfigurationBuilder
using the
variable $domain.domain_kwargs.column
.
-
class_name
: is the name of the DomainBuilder class that is to be used. Additional Domain Builders are:-
ColumnDomainBuilder
: ThisDomainBuilder
outputs column Domains, which are required byColumnExpectations
like (expect_column_median_to_be_between
). -
MultiColumnDomainBuilder
: This DomainBuilder outputsmulticolumn
Domains by taking in a column list in theinclude_column_names
parameter. -
ColumnPairDomainBuilder
: This DomainBuilder outputs columnpair domains by taking in a column pair list in the include_column_names parameter. -
TableDomainBuilder
: ThisDomainBuilder
outputs tableDomains
, which is required byExpectations
that act on tables, like (expect_table_row_count_to_equal
, orexpect_table_columns_to_match_set
). -
MapMetricColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on Map Metrics, which give a yes/no answer for individual values or rows. -
CategoricalColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on their cardinality (number of unique values).noteCategoricalColumnDomainBuilder
will take in variouscardinality_limit_mode
values for cardinality. For a full listing of valid modes, along with the associated values, please refer to theCardinalityLimitMode
enum in the source code on our GitHub.
-
-
module_name
: isgreat_expectations.rule_based_profiler.domain_builder
, which is common for allDomainBuilders
.
2c. Modifying the ParameterBuilder
Our example contains a configuration for one
ParamterBuilder
, a
NumericMetricRangeMultiBatchParameterBuilder
. You can find the other types of
ParameterBuilder
by browsing
the source code in our GitHub. For the
NumericMetricRangeMultiBatchParameterBuilder
the configuration key-value pairs consist of:
-
name
: an arbitrary name assigned to thisParameterBuilder
configuration. -
class_name
: the name of the class that corresponds to theParameterBuilder
defined by this configuration. -
module_name
:great_expectations.rule_based_profiler.parameter_builder
which is the same for allParameterBuilders
. -
json_serialize
: Boolean value that determines whether to convert computed value to JSON prior to saving result. -
estimator
: choice of the estimation algorithm: "oneshot" (one observation), "bootstrap" (default), or "kde" (kernel density estimation). Value is pulled from$variables.estimator
, which is set to "bootstrap" in our configuration. -
quantile_statistic_interpolation_method
: Applicable for the "bootstrap" sampling method. Determines the value of interpolation "method" tonp.quantile()
statistic, which is used for confidence intervals. Value is pulled from$variables.quantile_statistic_interpolation_method
, which is set to "auto" in our configuration. -
enforce_numeric_metric
: used inMetricConfiguration
to ensure that metric computations return numeric values. Set toTrue
. -
n_resamples
: Applicable for the "bootstrap" and "kde" sampling methods -- if omitted (default), then 9999 is used. Value is pulled from$variables.n_resamples
, which is set to9999
in our configuration. -
round_decimals
: User-configured non-negative integer indicating the number of decimals of the rounding precision of the computed parameter values (i.e.,min_value
,max_value
) prior to packaging them on output. If omitted, then no rounding is performed, unless the computed value is already an integer. Value is pulled from$variables.round_decimals
which is2
in our configuration. -
reduce_scalar_metric
: IfTrue
(default), then reduces computation of 1-dimensional metric to scalar value. This value is set toTrue
. -
include_estimator_samples_histogram_in_details
: For the "bootstrap" sampling method -- if True, then add 10-bin histogram of bootstraps to "details"; otherwise, omit this information (default). Value pulled from$variables.include_estimator_samples_histogram_in_details
, which isFalse
in our configuration. -
truncate_values
: User-configured directive for whether or not to allow the computed parameter values (i.e.,lower_bound
,upper_bound
) to take on values outside the specified bounds when packaged on output. Value pulled from$variables.truncate_values
, which isNone
in our configuration. -
false_positive_rate
: User-configured fraction between 0 and 1 expressing desired false positive rate for identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Value pulled from$variables.false_positive_rate
and is0.05
in our configuration. -
replace_nan_with_zero
: If False, then if the computed metric givesNaN
, then exception is raised; otherwise, if True (default), then if the computed metric gives NaN, then it is converted to the 0.0 (float) value. Set toTrue
in our configuration. -
metric_domain_kwargs
: Domain values forParameteBuilder
. Pulled from$domain.domain_kwargs
, and is empty in our configuration.
2d. Modifing the
expectation_configuration_builders
Our Configuration contains 1
ExpectationConfigurationBuilder
, for the
expect_column_mean_to_be_between
Expectation type.
The
ExpectationConfigurationBuilder
configuration requires a
expectation_type
,
class_name
and module_name
:
-
expectation_type
:expect_column_mean_to_be_between
-
class_name
:DefaultExpectationConfigurationBuilder
-
module_name
:great_expectations.rule_based_profiler.expectation_configuration_builder
which is common for allExpectationConfigurationBuilders
Also included are:
-
validation_parameter_builder_configs
: Which are a list ofValidationParameterBuilder
configurations, and our configuration case contains theParameterBuilder
described in the previous section.
Next are the parameters that are specific to the
expect_column_mean_to_be_between
Expectation
.
-
column
: Pulled fromDomainBuilder
using the parameter$domain.domain_kwargs.column
-
min_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[0]
-
max_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[1]
-
strict_min
: Pulled from `$variables.strict_min
, which isFalse
. -
strict_max
: Pulled from `$variables.strict_max
, which isFalse
.
Last is meta
which contains
details
from our
parameter_builder
.
3. Assign your configuration to the
default_profiler_config
class attribute
of your Expectation
Once you have modified the necessary parts of the Rule
Based Profiler configuration to suite your purposes
you will need to assign it to the
default_profiler_config
class attribute
of your Expectation. If you initially copied the Rule
Based Profiler configuration that you modified from
another Expectation that was already set up to work
with the auto-initializing framework then you can
refer to that Expectation for an example of this.
4. Test your Expectation with auto=True
After assigning your Rule Based Profiler configuration
to the default_profiler_config
attribute
of your Expectation, your Expectation should be able
to work in the auto-initializing framework.
Test your expectation
with the parameter auto=True
.