Use Great Expectations with Amazon Web Services using S3 and Pandas
Great Expectations can work within many frameworks. In this guide you will be shown a workflow for using Great Expectations with AWS and cloud storage. You will configure a local Great Expectations project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You will further configure Great Expectations to use Pandas and access data stored in another Amazon S3 bucket.
This guide will demonstrate each of the steps necessary to go from installing a new instance of Great Expectations to Validating your data for the first time and viewing your Validation Results as Data Docs.
Prerequisites
- An installation of Python, version 3.8 to 3.11. To download and install Python, see Python downloads.
- The AWS CLI. To download and install the AWS CLI, see Installing or updating the latest version of the AWS CLI.
- AWS credentials. See Configuring the AWS CLI.
-
Permissions to install the Python packages (
boto3
andgreat_expectations
) with pip. - An S3 bucket and prefix to store Expectations and Validation Results.
Steps
Part 1: Setup
1.1 Ensure that the AWS CLI is ready for use
1.1.1 Verify that the AWS CLI is installed
Run the following code to verify that the AWS CLI is installed:
aws --version
If this code does not return the AWS CLI version information, you may need to install the AWS CLI or troubleshoot your current installation. See Install or update the latest version of the AWS CLI
1.1.2 Verify that your AWS credentials are properly configured
Run the following command in the AWS CLI to verify that your AWS credentials are properly configured:
aws sts get-caller-identity
When your credentials are properly configured, your
UserId
, Account
, and
Arn
are returned. If your credentials are
not configured correctly, an error message appears. If
you received an error message, or you couldn't
verify your credentials, see
Configuring the AWS CLI.
1.2 Prepare a local installation of Great Expectations
1.2.1 Verify that your Python version meets requirements
Run the following code to check what version of Python is currently installed:
python --version
Great Expectations supports Python versions 3.8 to 3.11. If a Python 3 version number is not returned, run the following code:
python3 --version
If you do not have Python 3 installed, go to python.org for the current downloads and installation guidance.
1.2.2 Create a virtual environment for your Great Expectations project
After you have confirmed that Python 3 is installed
locally, you can create a virtual environment with
venv
before installing your packages with
pip
. The following examples use
venv
for virtual environments because it
is included with Python 3. You can use alternate tools
such as virtualenv and pyenv to install GX in virtual
environments.
Run one of the following code blocks to create your virtual environment:
python -m venv my_venv
or
python3 -m venv my_venv
A new directory named my_venv
is created
in your virtual environment.
Run the following code to activate the virtual environment:
source my_venv/bin/activate
To change the name of your virtual environment,
replace my_venv
in the example code.
1.2.3 Ensure you have the latest version of pip
After you've activated your virtual environment, you should ensure that you have the latest version of pip installed. Pip is a tool that is used to easily install Python packages.
Run the following code to ensure that you have the latest version of pip installed:
python -m ensurepip --upgrade
or
python3 -m ensurepip --upgrade
1.2.4 Install boto3
Python interacts with AWS through the
boto3
library. Great Expectations makes
use of this library in the background when working
with AWS. Although you won't use
boto3
directly, you'll need to
install it in your virtual environment.
Run one of the following pip commands to install
boto3
in your virtual environment:
python -m pip install boto3
or
python3 -m pip install boto3
To set up
boto3
with AWS, and use boto3
within Python,
see the
Boto3 documentation.
1.2.5 Install Great Expectations
Run one of the following code blocks to use pip to install Great Expectations:
python -m pip install great_expectations
or
python3 -m pip install great_expectations
1.2.6 Verify that Great Expectations installed successfully
Run the following code to confirm the GX installation is working:
great_expectations --version
Version information similar to the following is returned:
great_expectations, version 0.17.23
1.3 Create your Data Context
The simplest way to create a new
Data ContextThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
is by using the create()
method.
From a Notebook or script where you want to deploy
Great Expectations run the following command. Here the
full_path_to_project_directory
can be an
empty directory where you intend to build your Great
Expectations configuration.:
import great_expectations as gx
context = gx.data_context.FileDataContext.create(full_path_to_project_directory)
1.4 Configure your Expectations Store on Amazon S3
1.4.1 Identify your Data Context Expectations Store
Your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
The following section in your
Data ContextThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
great_expectations.yml
file tells Great
Expectations to look for Expectations in a Store named
expectations_store
:
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
expectations_store_name: expectations_store
The default base_directory
for
expectations_store
is
expectations/
.
1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3
To manually add an
Expectations StoreA connector to store and retrieve information
about collections of verifiable assertions about
data.
to your configuration, add the following configuration
to the stores
section of your
great_expectations.yml
file:
stores:
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your>'
prefix: '<your>' # Bucket and prefix in combination must be unique across all stores
expectations_store_name: expectations_S3_store
Change the default store_backend
settings
to make the Store work with S3. The
class_name
is set to
TupleS3StoreBackend
,
bucket
is the address of your S3 bucket,
and prefix
is the folder in your S3
bucket where Expectations are located.
The following example shows the additional options
that are available to customize
TupleS3StoreBackend
:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
In the previous example, the Store name is
expectations_S3_store
. If you use a
personalized Store name, you must also update the
value of the expectations_store_name
key
to match the Store name. For example:
expectations_store_name: expectations_S3_store
When you update the
expectations_store_name
key value, Great
Expectations uses the new Store for Validation
Results.
Add the following code to
great_expectations.yml
to configure the
IAM user:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
aws_session_token: ${AWS_ACCESS_KEY_ID}
Add the following code to
great_expectations.yml
to configure the
IAM Assume Role:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
boto3_options:
assume_role_arn: '<your_role_to_assume>'
region_name: '<your_aws_region_name>'
assume_role_duration: session_duration_in_seconds
If you're storing
Validations in S3
or
DataDocs in S3, make sure that the prefix
values
are disjoint and one is not a substring of the
other.
1.4.3 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Expectations saved that you want to transfer to your S3 bucket.
Run the following aws s3 sync
command to
copy Expectations into Amazon S3:
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
The base_directory
is set to
expectations/
by default.
In the following example, the Expectations
exp1
and exp2
are copied to
Amazon S3 and a confirmation message is returned:
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
1.5 Configure your Validation Results Store on Amazon S3
1.5.1 Identify your Data Context's Validation Results Store
Your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
The following section in your
Data ContextThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
great_expectations.yml
file tells Great
Expectations to look for Validation Results in a Store
named validations_store
. It also creates
a ValidationsStore
named
validations_store
that is backed by a
Filesystem and stores Validation Results under the
base_directory
uncommitted/validations
(the default).
stores:
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
validations_store_name: validations_store
1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3
To manually add a Validation Results Store, add the
following configuration to the
stores
section of your
great_expectations.yml
file:
stores:
validations_S3_store:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your>'
prefix: '<your>' # Bucket and prefix in combination must be unique across all stores
As shown in the previous example, you need to change
the default store_backend
settings to
make the Store work with S3. The
class_name
is set to
TupleS3StoreBackend
,
bucket
is the address of your S3 bucket,
and prefix
is the folder in your S3
bucket where Validation Results are located.
The following example shows the additional options
that are available to customize
TupleS3StoreBackend
:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
In the previous example, the Store name is
validations_S3_store
. If you use a
personalized Store name, you must also update the
value of the validations_store_name
key
to match the Store name. For example:
validations_store_name: validations_S3_store
When you update the
validations_store_name
key value, Great
Expectations uses the new Store for Validation
Results.
Add the following code to
great_expectations.yml
to configure the
IAM user:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
boto3_options:
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
aws_session_token: ${AWS_ACCESS_KEY_ID}
Add the following code to
great_expectations.yml
to configure the
IAM Assume Role:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
boto3_options:
assume_role_arn: '<your_role_to_assume>'
region_name: '<your_aws_region_name>'
assume_role_duration: session_duration_in_seconds
If you are also storing
ExpectationsA verifiable assertion about data.
in S3
How to configure an Expectation store to use
Amazon S3, or DataDocs in S3
How to host and share Data Docs, then make sure the prefix
values
are disjoint and one is not a substring of the
other.
1.5.3 (Optional) Copy existing Validation results to the Amazon S3 bucket
If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Validation Results saved that you want to transfer to your S3 bucket.
To copy Validation Results into Amazon S3, use the
aws s3 sync
command as shown in the
following example:
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
The base_directory
is set to
uncommitted/validations/
by default.
In the following example, the Validation Results
Validation1
and
Validation2
are copied to Amazon S3 and a
confirmation message is returned:
upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json
1.6 Configure Data Docs for hosting and sharing from Amazon S3
1.6.1 Create an Amazon S3 bucket for your Data Docs
In the AWS CLI, run the following command to create an S3 bucket configured for a specific location. Modify the bucket name and region for your environment.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
1.6.2 Configure your bucket policy to enable appropriate access
The example policy below
enforces IP-based access. Modify the
bucket name and IP addresses for your environment.
After you have customized the example policy to suit
your situation, name the file
ip-policy.json
and save it in your local
directory.
Your policy should limit access to authorized users. Data Docs sites can include sensitive information and should not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated
pages, it is important to include the
arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with
the
arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs'
front page.
Amazon Web Service's S3 buckets are a third party utility. For more information about configuring AWS S3 bucket policies, see Using bucket policies.
1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
1.6.4 Add a new Amazon S3 site to the
data_docs_sites
section of your
great_expectations.yml
The following example shows the default
local_site
configuration that you will
find in your great_expectations.yml
file,
followed by the s3_site
configuration
that you will need to add. To maintain a single S3
Data Docs site, remove the default
local_site
configuration and replace it
with the new s3_site
configuration.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
S3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your>'
site_index_builder:
class_name: DefaultSiteIndexBuilder
1.6.5 Test that your Data Docs configuration is correct by building the site
Run the following code to build and open your newly configured S3 Data Docs site:
context.build_data_docs()
Additional notes on hosting Data Docs from an Amazon S3 bucket
-
Run the following code to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
> aws s3 website s3://data-docs.my_org/ --index-document index.html
-
To host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet immediately after thebucket
property. -
To host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you can access the pages from your DNS (http://www.mydns.com/index.html
in our example)data_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
Part 2: Connect to data
2.1 Instantiate your project's DataContext
The simplest way to create a new
Data ContextThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
is by using the create()
method.
From a Notebook or script where you want to deploy
Great Expectations run the following command. Here the
full_path_to_project_directory
can be an
empty directory where you intend to build your Great
Expectations configuration.:
import great_expectations as gx
context = gx.data_context.FileDataContext.create(full_path_to_project_directory)
If you have already instantiated your
DataContext
in a previous step, this step
can be skipped.
2.2 Add Data Source to your DataContext
Using this example configuration, add in your S3 bucket and path to a directory that contains some of your data:
datasource = context.sources.add_or_update_pandas_s3(
name="s3_datasource", bucket="taxi-data-sample-test"
)
In the example, we have added a Data Source that
connects to data in S3 using a Pandas dataframe. The
name of the new datasource is
s3_datasource
and it refers to a S3
bucket named taxi-data-sample-test
.
2.2 Add CSV Asset to your Data Source
Add a CSV Asset
into your
Datasource
by using the
add_csv_asset
function.
asset = datasource.add_csv_asset(
name="csv_taxi_s3_asset",
batching_regex=r".*_(?P<year>\d{4})\.csv",
)
Here we have added a Asset
named
csv_taxi_s3_asset
using the
add_csv_asset
function. The
batching_regex
is a regular expression
that indicates which files to treat as Batches in the
Asset, and how to identify them.
In our example, the pattern
r".*_(?P<year>\d{4})\.csv",
is intended to build a Batch for each file in the S3
bucket, which are:
yellow_tripdata_sample_2020.csv
yellow_tripdata_sample_2021.csv
yellow_tripdata_sample_2022.csv
The batching_regex
pattern will match the
4 digits of the year portion and assign it to the
year
domain.
2.3 Test your new Data Source
Verify your new Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. by loading data from it into a ValidatorUsed to run an Expectation Suite against data. using a Batch RequestProvided to a Datasource in order to create a Batch..
We will use the BatchRequest
configured
in the previous step.
context.add_or_update_expectation_suite(expectation_suite_name="test_suite")
validator = context.get_validator(
batch_request=request, expectation_suite_name="test_suite"
)
print(validator.head())
Part 3: Create Expectations
3.1: Prepare a Batch Request, empty Expectation Suite, and Validator
When we tested our Data Source in step 2.3: Test your new Data Source we also created all of the components we need to begin creating Expectations: A Batch Request to provide sample data we can test our new Expectations against, an empty Expectation Suite to contain our new Expectations, and a Validator to create those Expectations with.
We can reuse those components now. Alternatively, you
may follow the same process that we did before and
define a new Batch Request, Expectation Suite, and
Validator if you wish to use a different Batch of data
as the reference sample when you are creating
Expectations or if you wish to use a different name
than test_suite
for your Expectation
Suite.
3.2: Use a Validator to add Expectations to the Expectation Suite
There are many Expectations available for you to use. To demonstrate the creation of an Expectation through the use of the Validator you defined earlier, here are examples of the process for two of them:
validator.expect_column_values_to_not_be_null(column="passenger_count")
validator.expect_column_values_to_be_between(
column="congestion_surcharge", min_value=0, max_value=1000
)
Each time you evaluate an Expectation with
validator.expect_*
, the Expectation is
immediately Validated against your provided Batch of
data. This instant feedback helps you identify
unexpected data quickly. The Expectation configuration
is stored in the Expectation Suite you provided when
the Validator was initialized.
You can also create Expectation Suites using a Data Assistant to automatically create expectations based on your data or manually using domain knowledge and without inspecting data directly.
To find out more about the available Expectations, see the Expectations Gallery.
3.3: Save the Expectation Suite
When you have run all of the Expectations you want for
this dataset, you can call
validator.save_expectation_suite()
to
save the Expectation Suite (all of the unique
Expectation Configurations from each run of
validator.expect_*
)for later use in a
Checkpoint.
validator.save_expectation_suite(discard_failed_expectations=False)
Part 4: Validate Data
4.1: Create and run a Checkpoint
To validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it, you create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for your Batch.
Checkpoints can be preconfigured with a Batch Request and Expectation Suite, or they can take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when Checkpoint runs.
To preconfigure a Checkpoint with a Batch Request and Expectation Suite, see Manage Checkpoints
4.1.1 Create a Checkpoint
Run the following code to create the Checkpoint:
checkpoint = context.add_or_update_checkpoint(
name="my_checkpoint",
validations=[{"batch_request": request, "expectation_suite_name": "test_suite"}],
)
The Checkpoint you created is named
my_checkpoint
. It includes a Validation
using the BatchRequest
you created
earlier, and an
ExpectationSuite
containing two
Expectations, test_suite
.
4.1.2 Run the Checkpoint
Run the following code to run the Checkpoint:
checkpoint_result = checkpoint.run()
4.2: Build and view Data Docs
The Checkpoint contains
UpdateDataDocsAction
which renders the
Data DocsHuman readable documentation generated from Great
Expectations metadata detailing Expectations,
Validation Results, etc.
from the generated Validation Results. The Data Docs
store contains a new entry for the rendered Validation
Result.
For more information on Actions that Checkpoints can perform and how to add them, see Configure Actions.
Run the following code to view the new entry for the rendered Validation Result:
context.open_data_docs()
Congratulations!
🚀🚀 Congratulations! 🚀🚀 You have successfully navigated the entire workflow for using Great Expectations with Amazon Web Services S3 and Pandas, from installing Great Expectations through Validating your Data.