How to create a Custom Multicolumn Map Expectation
MulticolumnMapExpectations
are a sub-type of
ExpectationA verifiable assertion about data.. They are evaluated for a set of columns and ask a
yes/no question about the row-wise relationship
between those columns. Based on the result, they then
calculate the percentage of rows that gave a positive
answer. If the percentage is high enough, the
Expectation considers that data valid.
This guide will walk you through the process of
creating a custom
MulticolumnMapExpectation
.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Set up your dev environment
- Read the overview for creating Custom Expectations.
Steps
1. Choose a name for your Expectation
First, decide on a name for your own Expectation. By
convention,
MulticolumnMapExpectations
always start
with expect_multicolumn_values_
. You can
see other naming conventions in the
Expectations section
of the code Style Guide.
Your Expectation will have two versions of the same
name: a CamelCaseName
and a
snake_case_name
. For example, this
tutorial will use:
-
ExpectMulticolumnValuesToBeMultiplesOfThree
-
expect_multicolumn_values_to_be_multiples_of_three
2. Copy and rename the template file
By convention, each Expectation is kept in its own python file, named with the snake_case version of the Expectation's name.
You can find the template file for a custom
MulticolumnMapExpectation
here. Download the file, place it in the appropriate
directory, and rename it to the appropriate name.
cp multicolumn_map_expectation_template.py /SOME_DIRECTORY/expect_multicolumn_values_to_be_multiples_of_three.py
Where should I put my Expectation file?
Within a development environment, you
don't need to put the file in a specific
folder. The Expectation itself should be
self-contained, and can be executed anywhere
as long as great_expectations
is
installed, which is sufficient for development
and testing. However, to use your new
Expectation alongside the other components of
Great Expectations (as one would for
production purposes), you'll need to make
sure the file is in the right place. Where the
right place is will depend on what you intend
to use it for.
-
If you're building a
Custom ExpectationAn extension of the `Expectation`
class, developed outside of the Great
Expectations library.
for personal use, you'll need to put it
in the
great_expectations/plugins/expectations
folder of your Great Expectations deployment, and import your Custom Expectation from that directory whenever it will be used. When you instantiate the correspondingDataContext
, it will automatically make all PluginsExtends Great Expectations' components and/or functionality. in the directory available for use. -
If you're building a Custom Expectation
to contribute to the open source project,
you'll need to put it in the repo for
the Great Expectations library itself. Most
likely, this will be within a package within
contrib/
:great_expectations/contrib/SOME_PACKAGE/SOME_PACKAGE/expectations/
. To use these Expectations, you'll need to install the package.
See our guide on how to use a Custom Expectation for more!
3. Generate a diagnostic checklist for your Expectation
Once you've copied and renamed the template file, you can execute it as follows.
python expect_multicolumn_values_to_be_multiples_of_three.py
The template file is set up so that this will run the
Expectation's
print_diagnostic_checklist()
method. This
will run a diagnostic script on your new Expectation,
and return a checklist of steps to get it to full
production readiness.
Completeness checklist for ExpectMulticolumnValuesToMatchSomeCriteria:
✔ Has a valid library_metadata object
Has a docstring, including a one-line short description
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
Has basic input validation and type checking
Has both Statement Renderers: prescriptive and diagnostic
Has core logic that passes tests for all applicable Execution Engines and SQL dialects
Has a robust suite of tests, as determined by a code owner
Has passed a manual review by a code owner for code standards and style guides
When in doubt, the next step to implement is the first one that doesn't have a ✔ next to it. This guide covers the first five steps on the checklist.
4. Change the Expectation class name and add a docstring
By convention, your MetricA computed attribute of data such as the mean of a column. class is defined first in a Custom Expectation. For now, we're going to skip to the Expectation class and begin laying the groundwork for the functionality of your Custom Expectation.
Let's start by updating your Expectation's name and docstring.
Replace the Expectation class name
class ExpectMulticolumnValuesToMatchSomeCriteria(MulticolumnMapExpectation):
with your real Expectation class name, in upper camel case:
class ExpectMulticolumnValuesToBeMultiplesOfThree(MulticolumnMapExpectation):
You can also go ahead and write a new one-line docstring, replacing
"""TODO: Add a docstring here"""
with something like:
"""Expect a set of columns to contain multiples of three."""
You'll also need to change the class name at the bottom of the file, by replacing this line:
ExpectMulticolumnValuesToMatchSomeCriteria().print_diagnostic_checklist()
with this one:
ExpectMulticolumnValuesToBeMultiplesOfThree().print_diagnostic_checklist()
Later, you can go back and write a more thorough docstring.
At this point you can re-run your diagnostic checklist. You should see something like this:
$ python expect_multicolumn_values_to_be_multiples_of_three.py
Completeness checklist for ExpectMulticolumnValuesToBeMultiplesOfThree:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...
Congratulations! You're one step closer to implementing a Custom Expectation.
5. Add example cases
Next, we're going to search for
examples = []
in your file, and replace
it with at least two test examples. These examples
serve a dual purpose:
-
They provide test fixtures that Great Expectations can execute automatically via pytest.
-
They help users understand the logic of your Expectation by providing tidy examples of paired input and output. If you contribute your Expectation to open source, these examples will appear in the Gallery.
Your examples will look something like this:
examples = [
{
"data": {
"col_a": [3, 6, 9, 12, -3],
"col_b": [0, -3, 6, 33, -9],
"col_c": [1, 5, 6, 27, 3],
"col_d": [-2, 3, 21, 0, 1],
},
"tests": [
{
"title": "basic_positive_test",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"column_list": ["col_a", "col_b", "col_c"], "mostly": 0.6},
"out": {
"success": True,
},
},
{
"title": "basic_negative_test",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"column_list": ["col_b", "col_c", "col_d"], "mostly": 0.5},
"out": {
"success": False,
},
},
],
"test_backends": [
{
"backend": "pandas",
"dialects": None,
},
{
"backend": "sqlalchemy",
"dialects": ["sqlite", "postgresql"],
},
],
}
]
Here's a quick overview of how to create test
cases to populate examples
. The overall
structure is a list of dictionaries. Each dictionary
has two keys:
-
data
: defines the input data of the example as a table/data frame. In this example the table has one column namedcol_a
and a second column namedcol_b
. Both columns have 6 rows. (Note: when you define multiple columns, make sure that they have the same number of rows.) -
tests
: a list of test cases to ValidateThe act of applying an Expectation Suite to a Batch. against the data frame defined in the correspondingdata
.-
title
should be a descriptive name for the test case. Make sure to have no spaces. -
include_in_gallery
: This must be set toTrue
if you want this test case to be visible in the Gallery as an example. -
in
contains exactly the parameters that you want to pass in to the Expectation."in": {"column_list": ["col_a", "col_b", "col_c"], "mostly": 0.8}
in the example above is equivalent toexpect_multicolumn_values_to_be_multiples_of_three(column_list=["col_a", "col_b", "col_c"], mostly=0.8)
-
out
is based on the Validation ResultGenerated when data is Validated against an Expectation or Expectation Suite. returned when executing the Expectation. -
exact_match_out
: if you setexact_match_out=False
, then you don’t need to include all the elements of the Validation Result object - only the ones that are important to test.
-
If you run your Expectation file again, you won't see any new checkmarks, as the logic for your Custom Expectation hasn't been implemented yet. However, you should see that the tests you've written are now being caught and reported in your checklist:
$ python expect_multicolumn_values_to_be_multiples_of_three.py
Completeness checklist for ExpectMulticolumnValuesToBeMultiplesOfThree:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description
...
Has core logic that passes tests for all applicable Execution Engines and SQL dialects
Only 0 / 2 tests for pandas are passing
Failing: basic_positive_test, basic_negative_test
...
For more information on tests and example cases,
see our guide on
how to create example cases for a Custom
Expectation.
6. Implement your Metric and connect it to your Expectation
This is the stage where you implement the actual business logic for your Expectation.
To do so, you'll need to implement a function
within a Metric, and link it to your Expectation.
By the time your Expectation is complete, your
Metric will have functions for all three
Execution EnginesA system capable of processing data to compute
Metrics.
(Pandas, Spark, & SQLAlchemy) supported by Great
Expectations. For now, we're only going to define
one.
Metrics answer questions about your data posed by
your Expectation,
and allow your Expectation to judge whether your
data meets
your expectations.
Your Metric function will have the
@multicolumn_condition_partial
decorator,
with the appropriate engine
. Metric
functions can be as complex as you like, but
they're often very short. For example,
here's the definition for a Metric function to
calculate whether values across a set of columns are
multiples of 3 using the
PandasExecutionEngine
.
@multicolumn_condition_partial(engine=PandasExecutionEngine)
def _pandas(cls, column_list, **kwargs):
return column_list.apply(lambda x: abs(x) % 3 == 0)
This is all that you need to define for now. The
MulticolumnMapMetricProvider
and
MulticolumnMapExpectation
classes have
built-in logic to handle all the machinery of data
validation, including standard parameters like
mostly
, generation of Validation Results,
etc.
Other parameters
Expectation Success Keys - A tuple consisting of values that must / could be provided by the user and defines how the Expectation evaluates success.
Expectation Default Kwarg Values (Optional) - Default values for success keys and the defined domain, among other values.
Metric Condition Value Keys (Optional) - Contains any additional arguments passed as parameters to compute the Metric.
Next, choose a Metric Identifier for your Metric. By
convention, Metric Identifiers for Column Pair Map
Expectations start with
multicolumn_values.
. The remainder of the
Metric Identifier simply describes what the Metric
computes, in snake case. For this example, we'll
use multicolumn_values.multiple_three
.
You'll need to substitute this metric into two places in the code. First, in the Metric class, replace
condition_metric_name = "version-0.15.50 METRIC NAME GOES HERE"
with
condition_metric_name = "version-0.15.50 multicolumn_values.multiple_three"
Second, in the Expectation class, replace
map_metric = "METRIC NAME GOES HERE"
with
map_metric = "multicolumn_values.multiple_three"
It's essential to make sure to use matching Metric Identifier strings across your Metric class and Expectation class. This is how the Expectation knows which Metric to use for its internal logic.
Finally, rename the Metric class name itself, using the camel case version of the Metric Identifier, minus any periods.
For example, replace:
class MulticolumnValuesMatchSomeCriteria(MulticolumnMapMetricProvider):
with
class MulticolumnValuesMultipleThree(MulticolumnMapMetricProvider):
Running your diagnostic checklist at this point should return something like this:
$ python expect_multicolumn_values_to_be_multiples_of_three.py
Completeness checklist for ExpectMulticolumnValuesToBeMultiplesOfThree:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...
7. Linting
Finally, we need to lint our now-functioning Custom
Expectation. Our CI system will test your code using
black
, and ruff
.
If you've set up your dev environment as recommended in the Prerequisites, these libraries will already be available to you, and can be invoked from your command line to automatically lint your code:
black <PATH/TO/YOUR/EXPECTATION.py>
ruff <PATH/TO/YOUR/EXPECTATION.py> --fix
If desired, you can automate this to happen at commit time. See our guidance on linting for more on this process.
Once this is done, running your diagnostic checklist should now reflect your Custom Expectation as meeting our linting requirements:
$ python expect_multicolumn_values_to_be_multiples_of_three.py
Completeness checklist for ExpectMulticolumnValuesToBeMultiplesOfThree:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
✔ Passes all linting checks
...
Congratulations!
🎉 You've just built
your first Custom MulticolumnMapExpectation! 🎉
If you've already built a
Custom Column Aggregate Expectation, you may notice that we didn't implement a
_validate
method here. While we have
to explicitly create this functionality for Column
Aggregate Expectations, Multicolumn Map
Expectations come with that functionality built
in; no extra _validate
needed!
8. Contribution (Optional)
This guide will leave you with a Custom Expectation sufficient for contribution back to Great Expectations at an Experimental level.
If you plan to contribute your Expectation to the
public open source project, you should update the
library_metadata
object before submitting
your
Pull Request. For example:
library_metadata = {
"tags": [], # Tags for this Expectation in the Gallery
"contributors": [ # Github handles for all contributors to this Expectation.
"@your_name_here", # Don't forget to add your github handle here!
],
}
would become
library_metadata = {
"tags": [
"basic math",
"multi-column expectation",
],
"contributors": ["@joegargery"],
}
This is particularly important because we want to make sure that you get credit for all your hard work!
For more information on our code standards and contribution, see our guide on Levels of Maturity for Expectations.
To view the full script used in this page, see it on GitHub: