How to configure an Expectation Store to use Amazon S3
By default, newly
ProfiledThe act of generating Metrics and candidate
Expectations from data.
ExpectationsA verifiable assertion about data.
are stored as
Expectation SuitesA collection of verifiable assertions about
data.
in JSON format in the
expectations/
subdirectory of your
great_expectations/
folder. This guide
will help you configure Great Expectations to store
them in an Amazon S3 bucket.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured a Data Context.
- Configured an Expectations Suite.
- The ability to install boto3 in your local environment.
- Identified the S3 bucket and prefix where Expectations will be stored.
Steps
1. Install boto3 with pip
Python interacts with AWS through the
boto3
library. Great Expectations makes
use of this library in the background when working
with AWS. Therefore, although you will not need to use
boto3
directly, you will need to have it
installed into your virtual environment.
You can do this with the pip command:
python -m pip install boto3
or
python3 -m pip install boto3
For more detailed instructions on how to set up
boto3
with AWS, and information on how you can use
boto3
from within Python, please
reference
boto3's documentation site.
2. Verify your AWS credentials are properly configured
If you have installed the AWS CLI, you can verify that your AWS credentials are properly configured by running the command:
aws sts get-caller-identity
If your credentials are properly configured, this will
output your UserId
,
Account
and Arn
. If your
credentials are not configured correctly, this will
throw an error.
If an error is thrown, or if you were unable to use the AWS CLI to verify your credentials configuration, you can find additional guidance on configuring your AWS credentials by referencing Amazon's documentation on configuring the AWS CLI.
2. Identify your Data Context Expectations Store
You can find your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data.'s configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
In your great_expectations.yml
file, look
for the following lines:
expectations_store_name: expectations_store
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
This configuration tells Great Expectations to look
for Expectations in a store called
expectations_store
. The
base_directory
for
expectations_store
is set to
expectations/
by default.
3. Update your configuration file to include a new Store for Expectations on S3
You can manually add an
Expectations StoreA connector to store and retrieve information
about collections of verifiable assertions about
data.
by adding the configuration shown below into the
stores
section of your
great_expectations.yml
file.
stores:
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
To make the store work with S3 you will need to make
some changes to default the
store_backend
settings, as has been done
in the above example. The
class_name
should be set to
TupleS3StoreBackend
,
bucket
will be set to the address of your
S3 bucket, and prefix
will be set to the
folder in your S3 bucket where Expectation files will
be located.
Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
For the above example, please also note that the new
Store's name is set to
expectations_S3_store
. This value can be
any name you like as long as you also update the value
of the expectations_store_name
key to
match the new Store's name.
expectations_store_name: expectations_S3_store
This update to the value of the
expectations_store_name
key will tell
Great Expectations to use the new Store for
Expectations.
If you are also storing
Validations in S3
or
DataDocs in S3, please ensure that the
prefix
values are disjoint and one is
not a substring of the other.
5. Confirm that the new Expectations Store has been added
You can verify that your Stores are properly configured by running the command:
great_expectations store list
This will list the currently configured Stores that
Great Expectations has access to. If you added a new
S3 Expectations Store, the output should include the
following ExpectationsStore
entry:
- name: expectations_S3_store
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
Notice the output contains only one Expectation Store:
your configuration contains the original
expectations_store
on the local
filesystem and the
expectations_S3_store
we just configured,
but the
great_expectations store list
command
only lists your active stores. For your Expecation
Store, this is the one that you set as the value of
the expectations_store_name
variable in
the configuration file:
expectations_S3_store
.
4. Copy existing Expectation JSON files to the S3 bucket (This step is optional)
If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Expectations saved that you wish to keep and transfer to your S3 bucket.
One way to copy Expectations into Amazon S3 is by
using the aws s3 sync
command. As
mentioned earlier, the base_directory
is
set to expectations/
by default.
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
In the example below, two Expectations,
exp1
and exp2
are copied to
Amazon S3. This results in the following output:
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
If you have Expectations to copy into S3, your output should look similar.
6. Confirm that Expectations can be accessed from
Amazon S3 by running
great_expectations suite list
If you followed the optional step to copy your existing Expectations to the S3 bucket, you can confirm that Great Expectations can find them by running the command:
great_expectations suite list
Your output should include the Expectations you copied
to Amazon S3. In the example, these Expectations were
stored in Expectation Suites named
exp1
and exp2
. This would
result in the following output from the above command:
2 Expectation Suites found:
- exp1
- exp2
Your output should look similar, with the names of your Expectation Suites replacing the names from the example.
If you did not copy Expectations to the new Store, you will see a message saying no Expectations were found.