How to add support for the auto-initializing framework to a custom Expectation
Prerequisites
Steps
1. Determine if the auto-initializing framework is appropriate to include in your Expectation
Auto-initializing Expectations automate parameter
estimation for Expectations, but not all parameters
require this kind of estimation. If your expectation
only takes in a Domain (such as the name of a column)
then it will not benefit from being configured to work
in the auto-initializing framework. In general, the
auto-initializing Expectation framework benefits those
Expectations that have numeric ranges which are
intended to be descriptive of the data found in a
Batch or Batches. Existing examples of these would be
Expectations such as
ExpectColumnMeanToBeBetween
,
ExpectColumnMaxToBeBetween
, or
ExpectColumnSumToBeBetween
.
2. Build a Custom Profiler for your Expectation
In order to automate the estimation of parameters, auto-initializing Expectations utilize a Custom Profiler. You will need to create an appropriate configuration for the Profiler that your Expectation will use. The easiest way to do this is to modify an existing Profiler configuration.
You can find existing Profiler configurations in the
source code for any Expectation that works within the
auto-initializing framework. For this example, we will
look at the existing configuration for the
ExpectColumnMeanToBeBetween
Expectation.
You can
view the source code for this Expectation on our
GitHub.
2a. Modifying variables
Key-value pairs defined in the
variables
portion of a Profiler
Configuration will be shared across all of its
Rules
and Rule
components.
This helps you define and keep track of values without
having to input them multiple times. In our example,
the variables
are:
-
strict_min
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
. -
strict_max
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
. -
false_positive_rate
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Typically, this will be a float0 <= 1.0
. -
quantile_statistic_interpolation_method
: Used byNumericMetricRangeMultiBatchParameterBuilder
, which is used when estimating quantile values (not relevant in our case). Recognized values includeauto
,nearest
, andlinear
. -
estimator
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values includeoneshot
,bootstrap
, andkde
. -
n_resamples
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Integer values are expected. -
include_estimator_samples_histogram_in_details
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values areTrue
orFalse
. -
truncate_values
: A value used by theNumericMetricRangeMultiBatchParameterBuilder
to specify the[lower_bound, upper_bound]
interval, where either boundary is numeric or None. In our case the value is an empty dictionary, and an equivalent configuration would have beentruncate_values : { lower_bound: None, upper_bound: None }
. -
round_decimals
: Used byNumericMetricRangeMultiBatchParameterBuilder
, and determines how many digits after the decimal point to output (in our case 2).
2b. Modifying the domain_builder
The DomainBuilder
configuration requries
a class_name
and
module_name
. In this example, we will be
using the ColumnDomainBuilder
which
outputs the column of interest (for example:
trip_distance
in the NYC taxi data) which
is then accessed by the
ExpectationConfigurationBuilder
using the
variable $domain.domain_kwargs.column
.
-
class_name
: is the name of the DomainBuilder class that is to be used. Additional Domain Builders are:-
ColumnDomainBuilder
: ThisDomainBuilder
outputs column Domains, which are required byColumnAggregateExpectations
like (expect_column_median_to_be_between
). -
MultiColumnDomainBuilder
: This DomainBuilder outputsmulticolumn
Domains by taking in a column list in theinclude_column_names
parameter. -
ColumnPairDomainBuilder
: This DomainBuilder outputs columnpair domains by taking in a column pair list in the include_column_names parameter. -
TableDomainBuilder
: ThisDomainBuilder
outputs tableDomains
, which is required byExpectations
that act on tables, like (expect_table_row_count_to_equal
, orexpect_table_columns_to_match_set
). -
MapMetricColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on Map Metrics, which give a yes/no answer for individual values or rows. -
CategoricalColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on their cardinality (number of unique values).noteCategoricalColumnDomainBuilder
will take in variouscardinality_limit_mode
values for cardinality. For a full listing of valid modes, along with the associated values, please refer to theCardinalityLimitMode
enum in the source code on our GitHub.
-
-
module_name
: isgreat_expectations.rule_based_profiler.domain_builder
, which is common for allDomainBuilders
.
2c. Modifying the ParameterBuilder
Our example contains a configuration for one
ParamterBuilder
, a
NumericMetricRangeMultiBatchParameterBuilder
. You can find the other types of
ParameterBuilder
by browsing
the source code in our GitHub. For the
NumericMetricRangeMultiBatchParameterBuilder
the configuration key-value pairs consist of:
-
name
: an arbitrary name assigned to thisParameterBuilder
configuration. -
class_name
: the name of the class that corresponds to theParameterBuilder
defined by this configuration. -
module_name
:great_expectations.rule_based_profiler.parameter_builder
which is the same for allParameterBuilders
. -
json_serialize
: Boolean value that determines whether to convert computed value to JSON prior to saving result. -
estimator
: choice of the estimation algorithm: "oneshot" (one observation), "bootstrap" (default), or "kde" (kernel density estimation). Value is pulled from$variables.estimator
, which is set to "bootstrap" in our configuration. -
quantile_statistic_interpolation_method
: Applicable for the "bootstrap" sampling method. Determines the value of interpolation "method" tonp.quantile()
statistic, which is used for confidence intervals. Value is pulled from$variables.quantile_statistic_interpolation_method
, which is set to "auto" in our configuration. -
enforce_numeric_metric
: used inMetricConfiguration
to ensure that metric computations return numeric values. Set toTrue
. -
n_resamples
: Applicable for the "bootstrap" and "kde" sampling methods -- if omitted (default), then 9999 is used. Value is pulled from$variables.n_resamples
, which is set to9999
in our configuration. -
round_decimals
: User-configured non-negative integer indicating the number of decimals of the rounding precision of the computed parameter values (i.e.,min_value
,max_value
) prior to packaging them on output. If omitted, then no rounding is performed, unless the computed value is already an integer. Value is pulled from$variables.round_decimals
which is2
in our configuration. -
reduce_scalar_metric
: IfTrue
(default), then reduces computation of 1-dimensional metric to scalar value. This value is set toTrue
. -
include_estimator_samples_histogram_in_details
: For the "bootstrap" sampling method -- if True, then add 10-bin histogram of bootstraps to "details"; otherwise, omit this information (default). Value pulled from$variables.include_estimator_samples_histogram_in_details
, which isFalse
in our configuration. -
truncate_values
: User-configured directive for whether or not to allow the computed parameter values (i.e.,lower_bound
,upper_bound
) to take on values outside the specified bounds when packaged on output. Value pulled from$variables.truncate_values
, which isNone
in our configuration. -
false_positive_rate
: User-configured fraction between 0 and 1 expressing desired false positive rate for identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Value pulled from$variables.false_positive_rate
and is0.05
in our configuration. -
replace_nan_with_zero
: If False, then if the computed metric givesNaN
, then exception is raised; otherwise, if True (default), then if the computed metric gives NaN, then it is converted to the 0.0 (float) value. Set toTrue
in our configuration. -
metric_domain_kwargs
: Domain values forParameteBuilder
. Pulled from$domain.domain_kwargs
, and is empty in our configuration.
2d. Modifing the
expectation_configuration_builders
Our Configuration contains 1
ExpectationConfigurationBuilder
, for the
expect_column_mean_to_be_between
Expectation type.
The
ExpectationConfigurationBuilder
configuration requires a
expectation_type
,
class_name
and module_name
:
-
expectation_type
:expect_column_mean_to_be_between
-
class_name
:DefaultExpectationConfigurationBuilder
-
module_name
:great_expectations.rule_based_profiler.expectation_configuration_builder
which is common for allExpectationConfigurationBuilders
Also included are:
-
validation_parameter_builder_configs
: Which are a list ofValidationParameterBuilder
configurations, and our configuration case contains theParameterBuilder
described in the previous section.
Next are the parameters that are specific to the
expect_column_mean_to_be_between
Expectation
.
-
column
: Pulled fromDomainBuilder
using the parameter$domain.domain_kwargs.column
-
min_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[0]
-
max_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[1]
-
strict_min
: Pulled from `$variables.strict_min
, which isFalse
. -
strict_max
: Pulled from `$variables.strict_max
, which isFalse
.
Last is meta
which contains
details
from our
parameter_builder
.
3. Assign your configuration to the
default_profiler_config
class attribute
of your Expectation
Once you have modified the necessary parts of the
Profiler configuration to suit your purposes you will
need to assign it to the
default_profiler_config
class attribute
of your Expectation. If you initially copied the
Profiler configuration that you modified from another
Expectation that was already set up to work with the
auto-initializing framework then you can refer to that
Expectation for an example of this.
4. Test your Expectation with auto=True
After assigning your Profiler configuration to the
default_profiler_config
attribute of your
Expectation, your Expectation should be able to work
in the auto-initializing framework.
Test your expectation
with the parameter auto=True
.