How to configure a Validation Result Store in Amazon S3
By default,
Validation ResultsGenerated when data is Validated against an
Expectation or Expectation Suite.
are stored in JSON format in the
uncommitted/validations/
subdirectory of
your great_expectations/
folder. Since
Validation Results may include examples of data (which
could be sensitive or regulated) they should not be
committed to a source control system. The following
steps will help you configure a new storage location
for Validation Results in Amazon S3.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured a Data Context.
- Configured an Expectations Suite.
- Configured a Checkpoint.
- The ability to install boto3 in your local environment.
- Identified the S3 bucket and prefix where Validation Results will be stored.
Since Validation Results may include examples of data (which could be sensitive or regulated) they should not be committed to a source control system.
Steps
1. Install boto3 to your local environment
Python interacts with AWS through the
boto3
library. Great Expectations makes
use of this library in the background when working
with AWS. Therefore, although you will not need to use
boto3
directly, you will need to have it
installed into your virtual environment.
You can do this with the pip command:
python -m pip install boto3
or
python3 -m pip install boto3
For more detailed instructions on how to set up
boto3
with AWS, and information on how you can use
boto3
from within Python, please
reference
boto3's documentation site.
2. Verify that your AWS credentials are properly configured
If you have installed the AWS CLI, you can verify that your AWS credentials are properly configured by running the command:
aws sts get-caller-identity
If your credentials are properly configured, this will
output your UserId
,
Account
and Arn
. If your
credentials are not configured correctly, this will
throw an error.
If an error is thrown, or if you were unable to use the AWS CLI to verify your credentials configuration, you can find additional guidance on configuring your AWS credentials by referencing Amazon's documentation on configuring the AWS CLI.
3. Identify your Data Context Validation Results Store
You can find your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..
Look for the following section in your
Data Context'sThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
great_expectations.yml
file:
validations_store_name: validations_store
stores:
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
This configuration tells Great Expectations to look
for Validation Results in a Store called
validations_store
. It also creates a
ValidationsStore
called
validations_store
that is backed by a
Filesystem and will store Validation Results under the
base_directory
uncommitted/validations
(the default).
4. Update your configuration file to include a new Store for Validation Results on S3
You can manually add a Validation Results Store by
adding the configuration below to the
stores
section of your
great_expectations.yml
file:
stores:
validations_S3_store:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
To make the Store work with S3, you will need to make
some changes from the default
store_backend
settings, as has been done
in the above example. The class_name
will
be set to TupleS3StoreBackend
,
bucket
will be set to the address of your
S3 bucket, and prefix
will be set to the
folder in your S3 bucket where Validation results will
be located.
For the example above, note that the new Store's
name is set to validations_S3_store
. This
can be any name you like, as long as you also update
the value of the
validations_store_name
key to match the
new Store's name.
validations_store_name: validations_S3_store
This update to the value of the
validations_store_name
key will tell
Great Expectations to use the new Store for Validation
Results.
If you are also storing
ExpectationsA verifiable assertion about data.
in S3 (How to configure an Expectation store to use
Amazon S3), or DataDocs in S3 (How to host and share Data Docs on Amazon S3), then please ensure that the
prefix
values are disjoint and one is
not a substring of the other.
5. Confirm that the new Validation Results Store has been properly added
You can verify your active Stores are configured correctly by running the terminal command:
great_expectations store list
This will list the currently configured Stores that
Great Expectations has access to. If you added a new
S3 Validation Results Store, the output should include
the following ValidationStore
entry:
- name: validations_S3_store
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
Please note that the
great_expectations store list
command
will specifically list your active Stores,
which are the ones specified by
expectations_store_name
,
validations_store_name
,
evaluation_parameter_store_name
, and
checkpoint_store_name
in the file
great_expectations.yml
. These are the
Stores that your Data Context will use by default.
To make Great Expectations look for Validation Results
on the S3 bucket, you must set the
validations_store_name
variable to the
name of your S3 Validations Store, which in our
example is validations_s3_store
.
Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your_s3_bucket_name>'
prefix: '<your_s3_bucket_folder_name>'
boto3_options:
endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
region_name: '<your_aws_region_name>'
6. Copy existing Validation results to the S3 bucket (This step is optional)
If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Validation Results saved that you wish to keep and transfer to your S3 bucket.
You can copy Validation Results into Amazon S3 is by
using the aws s3 sync
command. As
mentioned earlier, the base_directory
is
set to uncommitted/validations/
by
default.
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'
In the example below, two Validation Results,
Validation1
and
Validation2
are copied to Amazon S3. This
results in the following output:
upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json
If you have Validation Results to copy into S3, your output should look similar.
7. Confirm that the Validations Results Store has been correctly configured
Run a Checkpoint to store results in the new Validation Results Store on S3 then visualize the results by re-building Data Docs.
🚀🚀 Congratulations! 🚀🚀
You have configured your Validation Results Store to exist in your S3 bucket!