How to host and share Data Docs on Amazon S3
This guide will explain how to host and share Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. on AWS S3.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Set up a working deployment of Great Expectations
- Set up the AWS Command Line Interface
Steps
1. Create an S3 bucket
You can create an S3 bucket configured for a specific location using the AWS CLI. Make sure you modify the bucket name and region for your situation.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
2. Configure your bucket policy to enable appropriate access
The example policy below
enforces IP-based access - modify the
bucket name and IP addresses for your situation. After
you have customized the example policy to suit your
situation, save it to a file called
ip-policy.json
in your local directory.
Your policy should provide access only to appropriate users. Data Docs sites can include critical information about raw data and should generally not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated
pages, it is important to include the
arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with
the
arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs'
front page.
Amazon Web Service's S3 buckets are a third party utility. For more (and the most up to date) information on configuring AWS S3 bucket policies, please refer to Amazon's guide on using bucket policies.
3. Apply the policy
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
4. Add a new S3 site to the
data_docs_sites
section of your
great_expectations.yml
The below example shows the default
local_site
configuration that you will
find in your great_expectations.yml
file,
followed by the s3_site
configuration
that you will need to add. You may optionally remove
the default local_site
configuration
completely and replace it with the new
s3_site
configuration if you would only
like to maintain a single S3 Data Docs site.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
5. Test that your configuration is correct by building the site
Use the following CLI command:
great_expectations docs build --site-name
s3_site
to build and open your newly configured S3 Data Docs
site.
> great_expectations docs build --site-name s3_site
You will be presented with the following prompt:
The following Data Docs sites will be built:
- s3_site: https://s3.amazonaws.com/data-docs.my_org/index.html
Would you like to proceed? [Y/n]:
Signify that you would like to proceed by pressing the
return
key or entering Y
.
Once you have you will be presented with the following
messages:
Building Data Docs...
Done building Data Docs
If successful, the CLI will also open your newly built S3 Data Docs site and provide the URL, which you can share as desired. Note that the URL will only be viewable by users with IP addresses appearing in the above policy.
You may want to use the
-y/--yes/--assume-yes
flag with the
great_expectations docs build --site-name
s3_site
command. This flag causes the CLI to skip the
confirmation dialog.
This can be useful for non-interactive environments.
Additional notes
-
Optionally, you may wish to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
> aws s3 website s3://data-docs.my_org/ --index-document index.html
-
If you wish to host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet in step 4, immediately after thebucket
property. -
If you wish to host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you will be able to access the pages from your DNS (http://www.mydns.com/index.html
in our example)data_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true