How to use Great Expectations with Amazon Web Services using Athena
Great Expectations can work within many frameworks. In this guide you will be shown a workflow for using Great Expectations with AWS and cloud storage. You will configure a local Great Expectations project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You will further configure Great Expectations to access data stored in an Athena database.
This guide will demonstrate each of the steps necessary to go from installing a new instance of Great Expectations to Validating your data for the first time and viewing your Validation Results as Data Docs.
This guide assumes you have:
- Completed the Getting Started Tutorial
- Installed Python 3. (Great Expectations requires Python 3. For details on how to download and install Python on your platform, see python.org).
- Installed the AWS CLI. (For guidance on how install this, please see Amazon's documentation on how to install the AWS CLI)
- Configured your AWS credentials. (For guidance in doing this, please see Amazon's documentation on configuring the AWS CLI.
-
The ability to install Python packages (
boto3
andgreat_expectations
) with pip. - Identified the S3 bucket and prefix where Expectations and Validation Results will be stored.
Steps
Part 1: Setup
1.1 Ensure that the AWS CLI is ready for use
1.1.1 Verify that the AWS CLI is installed
You can verify that the AWS CLI has been installed by running the command:
aws --version
If this command does not respond by informing you of the version information of the AWS CLI, you may need to install the AWS CLI or otherwise troubleshoot your current installation. For detailed guidance on how to do this, please refer to Amazon's documentation on how to install the AWS CLI)
1.1.2 Verify that your AWS credentials are properly configured
If you have installed the AWS CLI, you can verify that your AWS credentials are properly configured by running the command:
aws sts get-caller-identity
If your credentials are properly configured, this will
output your UserId
,
Account
and Arn
. If your
credentials are not configured correctly, this will
throw an error.
If an error is thrown, or if you were unable to use the AWS CLI to verify your credentials configuration, you can find additional guidance on configuring your AWS credentials by referencing Amazon's documentation on configuring the AWS CLI.
1.2 Prepare a local installation of Great Expectations
1.2.1 Verify that your Python version meets requirements
First, check the version of Python that you have installed. As of this writing, Great Expectations supports versions 3.7 through 3.10 of Python.
You can check your version of Python by running:
python --version
If this command returns something other than a Python 3 version number (like Python 3.X.X), you may need to try this:
python3 --version
If you do not have Python 3 installed, please refer to python.org for the necessary downloads and guidance to perform the installation.
1.2.2 Create a virtual environment for your Great Expectations project
Once you have confirmed that Python 3 is installed
locally, you can create a virtual environment with
venv
before installing your packages with
pip
.
Python Virtual Environments
Depending on whether you found that you needed to run
python
or python3
in the
previous step, you will create your virtual
environment by running either:
python -m venv my_venv
or
python3 -m venv my_venv
This command will create a new directory called
my_venv
where your virtual environment is
located. In order to activate the virtual environment
run:
source my_venv/bin/activate
You can name your virtual environment anything you
like. Simply replace my_venv
in the
examples above with the name that you would like
to use.
1.2.3 Ensure you have the latest version of pip
Once your virtual environment is activated, you should ensure that you have the latest version of pip installed.
Pip is a tool that is used to easily install Python packages. If you have Python 3 installed you can ensure that you have the latest version of pip by running either:
python -m ensurepip --upgrade
or
python3 -m ensurepip --upgrade
1.2.4 Install boto3
Python interacts with AWS through the
boto3
library. Great Expectations makes
use of this library in the background when working
with AWS. Therefore, although you will not need to use
boto3
directly, you will need to have it
installed into your virtual environment.
You can do this with the pip command:
python -m pip install boto3
or
python3 -m pip install boto3
For more detailed instructions on how to set up
boto3
with AWS, and information on how you can use
boto3
from within Python, please
reference
boto3's documentation site.
1.2.5 Install Great Expectations
You can use pip to install Great Expectations by running the appropriate pip command below:
python -m pip install great_expectations
or
python3 -m pip install great_expectations
1.2.6 Verify that Great Expectations installed successfully
You can confirm that installation worked by running:
great_expectations --version
This should return something like:
great_expectations, version 0.15.49
1.3 Create your Data Context
The simplest way to create a new Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. is by using Great Expectations' CLI.
From the directory where you want to deploy Great Expectations run the following command:
great_expectations init
You should be presented with this output and prompt:
Using v3 (Batch Request) API
___ _ ___ _ _ _
/ __|_ _ ___ __ _| |_ | __|_ ___ __ ___ __| |_ __ _| |_(_)___ _ _ ___
| (_ | '_/ -_) _` | _| | _|\ \ / '_ \/ -_) _| _/ _` | _| / _ \ ' \(_-<
\___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
|_|
~ Always know what to expect from your data ~
Let's create a new Data Context to hold your project configuration.
Great Expectations will create a new directory with the following structure:
great_expectations
|-- great_expectations.yml
|-- expectations
|-- checkpoints
|-- plugins
|-- .gitignore
|-- uncommitted
|-- config_variables.yml
|-- data_docs
|-- validations
OK to proceed? [Y/n]:
When you see the prompt to proceed, enter
Y
or simply press the
enter
key to continue. Great Expectations
will then build out the directory structure and
configuration files it needs for you to proceed.
1.4 Configure your Expectations Store on Amazon S3
1.4.1 Identify your Data Context Expectations Store
You can find your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data.'s configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
In your great_expectations.yml
file, look
for the following lines:
expectations_store_name: expectations_store
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
This configuration tells Great Expectations to look
for Expectations in a store called
expectations_store
. The
base_directory
for
expectations_store
is set to
expectations/
by default.
1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3
You can manually add an
Expectations StoreA connector to store and retrieve information
about collections of verifiable assertions about
data.
by adding the configuration shown below into the
stores
section of your
great_expectations.yml
file.
stores:
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
To make the store work with S3 you will need to make
some changes to default the
store_backend
settings, as has been done
in the above example. The
class_name
should be set to
TupleS3StoreBackend
,
bucket
will be set to the address of your
S3 bucket, and prefix
will be set to the
folder in your S3 bucket where Expectation files will
be located.
Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
For the above example, please also note that the new
Store's name is set to
expectations_S3_store
. This value can be
any name you like as long as you also update the value
of the expectations_store_name
key to
match the new Store's name.
expectations_store_name: expectations_S3_store
This update to the value of the
expectations_store_name
key will tell
Great Expectations to use the new Store for
Expectations.
If you are also storing
Validations in S3
or
DataDocs in S3, please ensure that the
prefix
values are disjoint and one is
not a substring of the other.
1.4.3 Verify that the new Amazon S3 Expectations Store has been added successfully
You can verify that your Stores are properly configured by running the command:
great_expectations store list
This will list the currently configured Stores that
Great Expectations has access to. If you added a new
S3 Expectations Store, the output should include the
following ExpectationsStore
entry:
- name: expectations_S3_store
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
Notice the output contains only one Expectation Store:
your configuration contains the original
expectations_store
on the local
filesystem and the
expectations_S3_store
we just configured,
but the
great_expectations store list
command
only lists your active stores. For your Expecation
Store, this is the one that you set as the value of
the expectations_store_name
variable in
the configuration file:
expectations_S3_store
.
1.4.4 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Expectations saved that you wish to keep and transfer to your S3 bucket.
One way to copy Expectations into Amazon S3 is by
using the aws s3 sync
command. As
mentioned earlier, the base_directory
is
set to expectations/
by default.
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
In the example below, two Expectations,
exp1
and exp2
are copied to
Amazon S3. This results in the following output:
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
If you have Expectations to copy into S3, your output should look similar.
1.4.5 (Optional) Verify that copied Expectations can be accessed from Amazon S3
If you followed the optional step to copy your existing Expectations to the S3 bucket, you can confirm that Great Expectations can find them by running the command:
great_expectations suite list
Your output should include the Expectations you copied
to Amazon S3. In the example, these Expectations were
stored in Expectation Suites named
exp1
and exp2
. This would
result in the following output from the above command:
2 Expectation Suites found:
- exp1
- exp2
Your output should look similar, with the names of your Expectation Suites replacing the names from the example.
If you did not copy Expectations to the new Store, you will see a message saying no Expectations were found.
1.5 Configure your Validation Results Store on Amazon S3
1.5.1 Identify your Data Context's Validation Results Store
You can find your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
Look for the following section in your
Data Context'sThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
great_expectations.yml
file:
validations_store_name: validations_store
stores:
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
This configuration tells Great Expectations to look
for Validation Results in a Store called
validations_store
. It also creates a
ValidationsStore
called
validations_store
that is backed by a
Filesystem and will store Validation Results under the
base_directory
uncommitted/validations
(the default).
1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3
You can manually add a Validation Results Store by
adding the configuration below to the
stores
section of your
great_expectations.yml
file:
stores:
validations_S3_store:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
To make the Store work with S3, you will need to make
some changes from the default
store_backend
settings, as has been done
in the above example. The class_name
will
be set to TupleS3StoreBackend
,
bucket
will be set to the address of your
S3 bucket, and prefix
will be set to the
folder in your S3 bucket where Validation results will
be located.
For the example above, note that the new Store's
name is set to validations_S3_store
. This
can be any name you like, as long as you also update
the value of the
validations_store_name
key to match the
new Store's name.
validations_store_name: validations_S3_store
This update to the value of the
validations_store_name
key will tell
Great Expectations to use the new Store for Validation
Results.
If you are also storing
ExpectationsA verifiable assertion about data.
in S3 (How to configure an Expectation store to use
Amazon S3), or DataDocs in S3 (How to host and share Data Docs on Amazon S3), then please ensure that the
prefix
values are disjoint and one is
not a substring of the other.
1.5.3 Verify that the new Amazon S3 Validation Results Store has been added successfully
You can verify your active Stores are configured correctly by running the terminal command:
great_expectations store list
This will list the currently configured Stores that
Great Expectations has access to. If you added a new
S3 Validation Results Store, the output should include
the following ValidationStore
entry:
- name: validations_S3_store
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
Please note that the
great_expectations store list
command
will specifically list your active Stores,
which are the ones specified by
expectations_store_name
,
validations_store_name
,
evaluation_parameter_store_name
, and
checkpoint_store_name
in the file
great_expectations.yml
. These are the
Stores that your Data Context will use by default.
To make Great Expectations look for Validation Results
on the S3 bucket, you must set the
validations_store_name
variable to the
name of your S3 Validations Store, which in our
example is validations_s3_store
.
Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
1.5.4 (Optional) Copy existing Validation results to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Validation Results saved that you wish to keep and transfer to your S3 bucket.
You can copy Validation Results into Amazon S3 is by
using the aws s3 sync
command. As
mentioned earlier, the base_directory
is
set to uncommitted/validations/
by
default.
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
In the example below, two Validation Results,
Validation1
and
Validation2
are copied to Amazon S3. This
results in the following output:
upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json
If you have Validation Results to copy into S3, your output should look similar.
1.6 Configure Data Docs for hosting and sharing from Amazon S3
1.6.1 Create an Amazon S3 bucket for your Data Docs
You can create an S3 bucket configured for a specific location using the AWS CLI. Make sure you modify the bucket name and region for your situation.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
1.6.2 Configure your bucket policy to enable appropriate access
The example policy below
enforces IP-based access - modify the
bucket name and IP addresses for your situation. After
you have customized the example policy to suit your
situation, save it to a file called
ip-policy.json
in your local directory.
Your policy should provide access only to appropriate users. Data Docs sites can include critical information about raw data and should generally not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated
pages, it is important to include the
arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with
the
arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs'
front page.
Amazon Web Service's S3 buckets are a third party utility. For more (and the most up to date) information on configuring AWS S3 bucket policies, please refer to Amazon's guide on using bucket policies.
1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
1.6.4 Add a new Amazon S3 site to the
data_docs_sites
section of your
great_expectations.yml
The below example shows the default
local_site
configuration that you will
find in your great_expectations.yml
file,
followed by the s3_site
configuration
that you will need to add. You may optionally remove
the default local_site
configuration
completely and replace it with the new
s3_site
configuration if you would only
like to maintain a single S3 Data Docs site.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
1.6.5 Test that your Data Docs configuration is correct by building the site
Use the following CLI command:
great_expectations docs build --site-name
s3_site
to build and open your newly configured S3 Data Docs
site.
> great_expectations docs build --site-name s3_site
You will be presented with the following prompt:
The following Data Docs sites will be built:
- s3_site: https://s3.amazonaws.com/data-docs.my_org/index.html
Would you like to proceed? [Y/n]:
Signify that you would like to proceed by pressing the
return
key or entering Y
.
Once you have you will be presented with the following
messages:
Building Data Docs...
Done building Data Docs
If successful, the CLI will also open your newly built S3 Data Docs site and provide the URL, which you can share as desired. Note that the URL will only be viewable by users with IP addresses appearing in the above policy.
You may want to use the
-y/--yes/--assume-yes
flag with the
great_expectations docs build --site-name
s3_site
command. This flag causes the CLI to skip the
confirmation dialog.
This can be useful for non-interactive environments.
Additional notes on hosting Data Docs from an Amazon S3 bucket
-
Optionally, you may wish to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
> aws s3 website s3://data-docs.my_org/ --index-document index.html
-
If you wish to host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet in step 4, immediately after thebucket
property. -
If you wish to host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you will be able to access the pages from your DNS (http://www.mydns.com/index.html
in our example)data_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
Part 2: Connect to data
2.1 Choose how to run the code for creating a new Datasource
The previous sections of this guide involved manually editing configuration files to add configurations for Amazon S3 buckets. When setting up your Datasource configurations, it is simpler to use Great Expectation's Python API. We recommend doing this from a Jupyter Notebook, as you will then receive immediate feedback on the results of your code blocks. However, you can alternatively use a Python script in the IDE of your choice.
If you would like, you may use the Great Expectations CLICommand Line Interface to automatically generate a pre-configured Jupyter Notebook. To do so, run the following console command from the root directory of your Great Expectations project:
great_expectations datasource new
Once you have your pre-configured Jupyter Notebook, you should follow along in the YAML-based workflow in the following steps.
If you choose to work from a blank Jupyter Notebook or a Python script, you may find it easier to use the following Python dictionary workflow over the YAML workflow. Great Expectations supports either configuration method.
2.2 Instantiate your project's DataContext
Import these necessary packages and modules.
from ruamel import yaml
import great_expectations as gx
from great_expectations.core.batch import Batch, BatchRequest, RuntimeBatchRequest
Load your DataContext into memory using the
get_context()
method.
context = gx.get_context()
2.3 Configure your Datasource
2.3.1 Determine your connection string
In order for Great Expectations to connect to Athena, you will need to provide a connection string. To determine your connection string, reference the examples below and the PyAthena documentation.
The following urls don't include credentials as it is recommended to use either the instance profile or the boto3 configuration file.
If you want Great Expectations to connect to your Athena instance (without specifying a particular database), the URL should be:
awsathena+rest://@athena.{region}.amazonaws.com/?s3_staging_dir={s3_path}
Note the url parameter "s3_staging_dir" needed for storing query results in S3.
If you want Great Expectations to connect to a particular database inside your Athena, the URL should be:
awsathena+rest://@athena.{region}.amazonaws.com/{database}?s3_staging_dir={s3_path}
credentials
instead of
connection_string
The credentials
key uses a dictionary
to provide the elements of your connection string
as separate, individual values. For information on
how to populate the
credentials
dictionary and how to
configure your
great_expectations.yml
project config
file to populate credentials from either a YAML
file or a secret manager, please see our guide on
How to configure credentials.
2.3.2 Create your Datasource configuration
Using the following example configuration, add in the
CONNECTION_STRING
for your database:
- YAML
- Python
datasource_yaml = f"""
name: my_awsathena_datasource
class_name: Datasource
execution_engine:
class_name: SqlAlchemyExecutionEngine
module_name: great_expectations.execution_engine
connection_string: {connection_string}
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
default_inferred_data_connector_name:
class_name: InferredAssetSqlDataConnector
module_name: great_expectations.datasource.data_connector
include_schema_name: true
"""
datasource_dict = {
"name": "my_awsathena_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": connection_string,
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
"module_name": "great_expectations.datasource.data_connector",
},
"default_inferred_data_connector_name": {
"class_name": "InferredAssetSqlDataConnector",
"module_name": "great_expectations.datasource.data_connector",
"include_schema_name": "true",
},
},
}
Datasources can be configured in many customized ways. For additional information on how to configure a SQL datasource such as the one used to connect to your Athena data, please see our guide on how to configure a SQL Datasource
2.4 Save the Datasource configuration to your DataContext
Save the configuration into your
DataContext
by using the
add_datasource()
function.
- YAML
- Python
context.add_datasource(**yaml.load(datasource_yaml))
context.add_datasource(**datasource_config)
2.5 Test your new Datasource
First, create a Batch Request.
batch_request = {
"datasource_name": "my_awsathena_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": f"{ATHENA_DB_NAME}.taxitable",
"limit": 1000,
}
Next, prepare an empty Expectation suite.
expectation_suite_name = "version-0.15.50 my_awsathena_expectation_suite"
try:
suite = context.get_expectation_suite(expectation_suite_name=expectation_suite_name)
print(
f'Loaded ExpectationSuite "{suite.expectation_suite_name}" containing {len(suite.expectations)} expectations.'
)
except DataContextError:
suite = context.add_expectation_suite(expectation_suite_name=expectation_suite_name)
print(f'Created ExpectationSuite "{suite.expectation_suite_name}".')
Now you can load data into a Validator
.
If this is successful then you will have verified that
your Datasource is working properly.
validator = context.get_validator(
batch_request=BatchRequest(**batch_request),
expectation_suite_name=expectation_suite_name,
)
validator.head(n_rows=5, fetch_all=False)
Part 3: Create Expectations
3.1: Prepare a Batch Request, empty Expectation Suite, and Validator
When we tested our Datasource in step 2.5: Test your new Datasource we also created all of the components we need to begin creating Expectations: A Batch Request to provide sample data we can test our new Expectations against, an empty Expectation Suite to contain our new Expectations, and a Validator to create those Expectations with.
We can reuse those components now. Alternatively, you
may follow the same process that we did before and
define a new Batch Request, Expectation Suite, and
Validator if you wish to use a different Batch of data
as the reference sample when you are creating
Expectations or if you wish to use a different name
than test_suite
for your Expectation
Suite.
3.2: Use a Validator to add Expectations to the Expectation Suite
There are many Expectations available for you to use. To demonstrate creating an Expectation through the use of the Validator we defined earlier, here are examples of the process for two of them:
validator.expect_column_values_to_not_be_null(column="passenger_count")
validator.expect_column_values_to_be_between(
column="congestion_surcharge", min_value=0, max_value=1000
)
Each time you evaluate an Expectation (e.g. via
validator.expect_*
) two things will
happen. First, the Expectation will immediately be
Validated against your provided Batch of data. This
instant feedback helps to zero in on unexpected data
very quickly, taking a lot of the guesswork out of
data exploration. Second, the Expectation
configuration will be stored in the Expectation Suite
you provided when the Validator was initialized.
This is the same method of interactive Expectation
Suite editing used in the CLI interactive mode
notebook accessed via
great_expectations suite new --interactive
. For more information, see our documentation on
How to create and edit Expectations with instant
feedback from a sample Batch of data.
You can also create Expectation Suites using a Data Assistant to automatically create expectations based on your data or manually using domain knowledge and without inspecting data directly.
To find out more about the available Expectations, please see our Expectations Gallery.
3.3: Save the Expectation Suite
When you have run all of the Expectations you want for
this dataset, you can call
validator.save_expectation_suite()
to
save the Expectation Suite (all of the unique
Expectation Configurations from each run of
validator.expect_*
)for later use in a
Checkpoint.
validator.save_expectation_suite(discard_failed_expectations=False)
Part 4: Validate Data
4.1: Create and run a Checkpoint
Here we will create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for our Batch, which we can use to validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it.
Checkpoints are a robust resource that can be preconfigured with a Batch Request and Expectation Suite or take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when the Checkpoint is run.
This guide will demonstrate using a
SimpleCheckpoint
that takes in a Batch
Request and Expectation Suite as parameters for the
context.run_checkpoint(...)
command.
For more information on pre-configuring a Checkpoint with a Batch Request and Expectation Suite, please see our guides on Checkpoints.
4.1.1 Create a Checkpoint
First we create the Checkpoint configuration:
- YAML
- Python
my_checkpoint_name = "version-0.15.50 insert_your_checkpoint_name_here"
my_checkpoint_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
"""
my_checkpoint_name = "version-0.15.50 insert_your_checkpoint_name_here"
checkpoint_config = {
"name": my_checkpoint_name,
"config_version": 1.0,
"class_name": "SimpleCheckpoint",
"run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
}
Once we have defined our Checkpoint configuration, we
can test our syntax using
context.test_yaml_config(...)
:
- YAML
- Python
my_checkpoint = context.test_yaml_config(my_checkpoint_config)
my_checkpoint = context.test_yaml_config(yaml.dump(checkpoint_config))
Note that we get a message that the Checkpoint
contains no validations. This is OK because we will
pass them in at runtime, as we can see below when we
call context.run_checkpoint(...)
.
4.1.2 Save the Checkpoint
After using
context.test_yaml_config(...)
to verify
that all is well, we can add the Checkpoint to our
Data Context:
- YAML
- Python
context.add_or_update_checkpoint(**yaml.load(my_checkpoint_config))
context.add_checkpoint(**yaml.load(my_checkpoint_config))
context.add_or_update_checkpoint(**checkpoint_config)
4.1.3 Run the Checkpoint
Finally, having added our Checkpoint to our Data
Context, we will run the Checkpoint. Since we did not
pre-configure the Checkpoint with a Batch Request and
Expectation Suite, we will pass those in as a list
item in the parameter validations
:
checkpoint_result = context.run_checkpoint(
checkpoint_name=my_checkpoint_name,
validations=[
{
"batch_request": batch_request,
"expectation_suite_name": expectation_suite_name,
}
],
)
4.2: Build and view Data Docs
Since we used a SimpleCheckpoint
, our
Checkpoint already contained an
UpdateDataDocsAction
which rendered our
Data DocsHuman readable documentation generated from Great
Expectations metadata detailing Expectations,
Validation Results, etc.
from the Validation Results we just generated. That
means our Data Docs store will contain a new entry for
the rendered Validation Result.
For more information on Actions that Checkpoints can perform and how to add them, please see our guides on Actions.
Viewing this new entry is as simple as running:
context.open_data_docs()
Congratulations!
🚀🚀 Congratulations! 🚀🚀 You have successfully navigated the entire workflow for using Great Expectations with Amazon Web Services and Athena, from installing Great Expectations through Validating your Data.