Validation
Once you've constructed and stored Expectations, you can use them to validate new data. Validation generates a report that details any specific deviations from expected values.
We recommend using a Data Context to manage Expectation Suites and coordinate validation across runs.
Validation Results
The report contains information about:
-
the overall success (the
success
field), -
summary statistics of the expectations (the
statistics
field), and -
the detailed results of each expectation (the
results
field).
Take the following example setup:
import json
import great_expectations as ge
my_expectation_suite = json.load(open("my_titanic_expectations.json"))
my_df = ge.read_csv(
"./tests/examples/titanic.csv",
expectation_suite=my_expectation_suite
)
my_df.validate()
The resulting report returned looks like this:
{
"results" : [
{
"expectation_type": "expect_column_to_exist",
"success": True,
"kwargs": {
"column": "Unnamed: 0"
}
},
...
{
"unexpected_list": 30.397989417989415,
"expectation_type": "expect_column_mean_to_be_between",
"success": True,
"kwargs": {
"column": "Age",
"max_value": 40,
"min_value": 20
}
},
{
"unexpected_list": [],
"expectation_type": "expect_column_values_to_be_between",
"success": True,
"kwargs": {
"column": "Age",
"max_value": 80,
"min_value": 0
}
},
{
"unexpected_list": [
"Downton (?Douton), Mr William James",
"Jacobsohn Mr Samuel",
"Seman Master Betros"
],
"expectation_type": "expect_column_values_to_match_regex",
"success": True,
"kwargs": {
"regex": "[A-Z][a-z]+(?: \\([A-Z][a-z]+\\))?, ",
"column": "Name",
"mostly": 0.95
}
},
{
"unexpected_list": [
"*"
],
"expectation_type": "expect_column_values_to_be_in_set",
"success": False,
"kwargs": {
"column": "PClass",
"value_set": [
"1st",
"2nd",
"3rd"
]
}
}
],
"success", False,
"statistics": {
"evaluated_expectations": 10,
"successful_expectations": 9,
"unsuccessful_expectations": 1,
"success_percent": 90.0,
}
}
Reviewing Validation Results
The easiest way to review Validation Results is to
view them from your local Data Docs site, where you
can also conveniently view Expectation Suites and with
additional configuration, Profiling Results (see
Data Docs site configuration). Out of the box, Great Expectations Data Docs is
configured to compile a local data documentation site
when you start a new project by running
great_expectations init
. By default, this
local site is saved to the
uncommitted/data_docs/local_site/
directory of your project and will contain pages for
Expectation Suites and Validation Results.
If you would like to review the raw Validation Results
in JSON format, the default Validation Results
directory is uncommitted/validations/
.
Note that by default, Data Docs will only compile
Validation Results located in this directory.
Checkpoints (formerly known as Validation Operators)
The example above demonstrates how to validate one
batch of data against one Expectation Suite. The
validate
method returns a dictionary of
Validation Results. This is sufficient when exploring
your data and getting to know Great Expectations. When
deploying Great Expectations in a real data pipeline,
you will typically discover additional needs:
- validating a group of batches that are logically related
- validating a batch against several Expectation Suites
- doing something with the Validation Results (e.g., saving them for a later review, sending notifications in case of failures, etc.).
Checkpoints are mini-applications that can be configured to implement these scenarios.
Read Checkpoints and Actions to learn more.
Reference Architectures
Useful Reference Architectures for validation include:
- Include validation at the end of a complex data transformation, to verify that no cases were lost, duplicated, or improperly merged.
- Include validation at the beginning of a script applying a machine learning model to a new batch of data, to verify that its distributed similarly to the training and testing set.
- Automatically trigger table-level validation when new data is dropped to an FTP site or S3 bucket, and send the validation report to the uploader and bucket owner by email.
- Schedule database validation jobs using cron, then capture errors and warnings (if any) and post them to Slack.
- Validate as part of an Airflow task: if Expectations are violated, raise an error and stop DAG propagation until the problem is resolved. Alternatively, you can implement Expectations that raise warnings without halting the DAG.
For certain Reference Architectures, it may be useful to parameterize Expectations, and supply Evaluation Parameters at validation time. See Evaluation Parameters for more information.