Version: 0.15.50

How to use Great Expectations with Amazon Web Services using S3 and Spark

Great Expectations can work within many frameworks. In this guide you will be shown a workflow for using Great Expectations with AWS and cloud storage. You will configure a local Great Expectations project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You will further configure Great Expectations to use Spark and access data stored in another Amazon S3 bucket.

This guide will demonstrate each of the steps necessary to go from installing a new instance of Great Expectations to Validating your data for the first time and viewing your Validation Results as Data Docs.

Prerequisites

This guide assumes you have:

Completed the Getting Started Tutorial
Installed Python 3. (Great Expectations requires Python 3. For details on how to download and install Python on your platform, see python.org).
Installed the AWS CLI. (For guidance on how install this, please see Amazon's documentation on how to install the AWS CLI)
Configured your AWS credentials. (For guidance in doing this, please see Amazon's documentation on configuring the AWS CLI.
The ability to install Python packages (boto3 and great_expectations) with pip.
Identified the S3 bucket and prefix where Expectations and Validation Results will be stored.

Steps

Part 1: Setup

1.1 Ensure that the AWS CLI is ready for use

1.1.1 Verify that the AWS CLI is installed

You can verify that the AWS CLI has been installed by running the command:

Terminal command

aws --version

If this command does not respond by informing you of the version information of the AWS CLI, you may need to install the AWS CLI or otherwise troubleshoot your current installation. For detailed guidance on how to do this, please refer to Amazon's documentation on how to install the AWS CLI)

1.1.2 Verify that your AWS credentials are properly configured

If you have installed the AWS CLI, you can verify that your AWS credentials are properly configured by running the command:

Terminal command

aws sts get-caller-identity

If your credentials are properly configured, this will output your UserId, Account and Arn. If your credentials are not configured correctly, this will throw an error.

If an error is thrown, or if you were unable to use the AWS CLI to verify your credentials configuration, you can find additional guidance on configuring your AWS credentials by referencing Amazon's documentation on configuring the AWS CLI.

1.2 Prepare a local installation of Great Expectations and necessary dependencies

1.2.1 Verify that your Python version meets requirements

First, check the version of Python that you have installed. As of this writing, Great Expectations supports versions 3.7 through 3.10 of Python.

You can check your version of Python by running:

Terminal command

python --version

If this command returns something other than a Python 3 version number (like Python 3.X.X), you may need to try this:

Terminal command

python3 --version

If you do not have Python 3 installed, please refer to python.org for the necessary downloads and guidance to perform the installation.

1.2.2 Create a virtual environment for your Great Expectations project

Once you have confirmed that Python 3 is installed locally, you can create a virtual environment with venv before installing your packages with pip.

Python Virtual Environments

We have chosen to use venv for virtual environments in this guide, because it is included with Python 3. You are not limited to using venv, and can just as easily install Great Expectations into virtual environments with tools such as virtualenv, pyenv, etc.

Depending on whether you found that you needed to run python or python3 in the previous step, you will create your virtual environment by running either:

Terminal command

python -m venv my_venv

Terminal command

python3 -m venv my_venv

This command will create a new directory called my_venv where your virtual environment is located. In order to activate the virtual environment run:

Terminal command

source my_venv/bin/activate

tip

You can name your virtual environment anything you like. Simply replace my_venv in the examples above with the name that you would like to use.

1.2.3 Ensure you have the latest version of pip

Once your virtual environment is activated, you should ensure that you have the latest version of pip installed.

Pip is a tool that is used to easily install Python packages. If you have Python 3 installed you can ensure that you have the latest version of pip by running either:

Terminal command

python -m ensurepip --upgrade

Terminal command

python3 -m ensurepip --upgrade

1.2.4 Install boto3

Python interacts with AWS through the boto3 library. Great Expectations makes use of this library in the background when working with AWS. Therefore, although you will not need to use boto3 directly, you will need to have it installed into your virtual environment.

You can do this with the pip command:

                            Terminal command
                          
python -m pip install boto3

                            Terminal command
                          
python3 -m pip install boto3

For more detailed instructions on how to set up boto3 with AWS, and information on how you can use boto3 from within Python, please reference boto3's documentation site.

1.2.5 Install Spark dependencies for S3

Spark possesses a few dependencies that need to be installed before it can be used with AWS. You will need to install the aws-java-sdk-bundle and hadoop-aws files corresponding to your version of pySpark, and update your Spark configuration accordingly. You can find the .jar files you need to install in the following MVN repositories:

Once the dependencies are installed, you will need to update your Spark configuration from within Python. First, import these necessary modules:

import pyspark as pyspark
from pyspark import SparkContext

Next, update the pyspark.SparkConf to match the dependency packages you downloaded. In this example, we are using the 3.3.1 version of hadoop-aws, but you will want to enter the version that corresponds to your installed dependency.

conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')

Finally, you will need to add your AWS credentials to the SparkContext.

sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', [AWS ACCESS KEY])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', [AWS SECRET KEY])

1.2.6 Install Great Expectations

You can use pip to install Great Expectations by running the appropriate pip command below:

Terminal command

python -m pip install great_expectations

Terminal command

python3 -m pip install great_expectations

1.2.7 Verify that Great Expectations installed successfully

You can confirm that installation worked by running:

Terminal command

great_expectations --version

This should return something like:

great_expectations, version 0.15.49

1.3 Create your Data Context

The simplest way to create a new Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. is by using Great Expectations' CLI.

From the directory where you want to deploy Great Expectations run the following command:

Terminal command

great_expectations init

You should be presented with this output and prompt:

Terminal output
Using v3 (Batch Request) API

  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~

Let's create a new Data Context to hold your project configuration.

Great Expectations will create a new directory with the following structure:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- data_docs
        |-- validations

OK to proceed? [Y/n]:

                              
                            

When you see the prompt to proceed, enter Y or simply press the enter key to continue. Great Expectations will then build out the directory structure and configuration files it needs for you to proceed.

1.4 Configure your Expectations Store on Amazon S3

1.4.1 Identify your Data Context Expectations Store

You can find your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data.'s configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

In your great_expectations.yml file, look for the following lines:

                            File contents: great_expectations.yml
                          
expectations_store_name: expectations_store

stores:
  expectations_store:
      class_name: ExpectationsStore
      store_backend:
          class_name: TupleFilesystemStoreBackend
          base_directory: expectations/

                              
                            

This configuration tells Great Expectations to look for Expectations in a store called expectations_store. The base_directory for expectations_store is set to expectations/ by default.

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3

You can manually add an Expectations StoreA connector to store and retrieve information about collections of verifiable assertions about data. by adding the configuration shown below into the stores section of your great_expectations.yml file.

                            File contents: great_expectations.yml
                          
stores:
  expectations_S3_store:
      class_name: ExpectationsStore
      store_backend:
          class_name: TupleS3StoreBackend
          bucket: '<your_s3_bucket_name>'
          prefix: '<your_s3_bucket_folder_name>'

                              
                            

To make the store work with S3 you will need to make some changes to default the store_backend settings, as has been done in the above example. The class_name should be set to TupleS3StoreBackend, bucket will be set to the address of your S3 bucket, and prefix will be set to the folder in your S3 bucket where Expectation files will be located.

Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.

                            File contents: great_expectations.yml
                          
class_name: ExpectationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'
  boto3_options:
    endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
    region_name: '<your_aws_region_name>'

                              
                            

For the above example, please also note that the new Store's name is set to expectations_S3_store. This value can be any name you like as long as you also update the value of the expectations_store_name key to match the new Store's name.

                            File contents: great_expectations.yml
                          
expectations_store_name: expectations_S3_store

This update to the value of the expectations_store_name key will tell Great Expectations to use the new Store for Expectations.

caution

If you are also storing Validations in S3 or DataDocs in S3, please ensure that the prefix values are disjoint and one is not a substring of the other.

1.4.3 Verify that the new Amazon S3 Expectations Store has been added successfully

You can verify that your Stores are properly configured by running the command:

Terminal command

great_expectations store list

This will list the currently configured Stores that Great Expectations has access to. If you added a new S3 Expectations Store, the output should include the following ExpectationsStore entry:

Terminal output
- name: expectations_S3_store
class_name: ExpectationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'

                              
                            

Notice the output contains only one Expectation Store: your configuration contains the original expectations_store on the local filesystem and the expectations_S3_store we just configured, but the great_expectations store list command only lists your active stores. For your Expecation Store, this is the one that you set as the value of the expectations_store_name variable in the configuration file: expectations_S3_store.

1.4.4 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket

If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Expectations saved that you wish to keep and transfer to your S3 bucket.

One way to copy Expectations into Amazon S3 is by using the aws s3 sync command. As mentioned earlier, the base_directory is set to expectations/ by default.

                            Terminal command
                          
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

In the example below, two Expectations, exp1 and exp2 are copied to Amazon S3. This results in the following output:

Terminal output
upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json

If you have Expectations to copy into S3, your output should look similar.

1.4.5 (Optional) Verify that copied Expectations can be accessed from Amazon S3

If you followed the optional step to copy your existing Expectations to the S3 bucket, you can confirm that Great Expectations can find them by running the command:

Terminal input

great_expectations suite list

Your output should include the Expectations you copied to Amazon S3. In the example, these Expectations were stored in Expectation Suites named exp1 and exp2. This would result in the following output from the above command:

Terminal output
2 Expectation Suites found:
- exp1
- exp2

Your output should look similar, with the names of your Expectation Suites replacing the names from the example.

If you did not copy Expectations to the new Store, you will see a message saying no Expectations were found.

1.5 Configure your Validation Results Store on Amazon S3

1.5.1 Identify your Data Context's Validation Results Store

You can find your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration within your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

Look for the following section in your Data Context'sThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml file:

                            File contents: great_expectations.yml
                          
validations_store_name: validations_store

stores:
  validations_store:
      class_name: ValidationsStore
      store_backend:
          class_name: TupleFilesystemStoreBackend
          base_directory: uncommitted/validations/

                              
                            

This configuration tells Great Expectations to look for Validation Results in a Store called validations_store. It also creates a ValidationsStore called validations_store that is backed by a Filesystem and will store Validation Results under the base_directory uncommitted/validations (the default).

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3

You can manually add a Validation Results Store by adding the configuration below to the stores section of your great_expectations.yml file:

                            File contents: great_expectations.yml
                          
stores:
  validations_S3_store:
      class_name: ValidationsStore
      store_backend:
          class_name: TupleS3StoreBackend
          bucket: '<your_s3_bucket_name>'
          prefix: '<your_s3_bucket_folder_name>'

                              
                            

To make the Store work with S3, you will need to make some changes from the default store_backend settings, as has been done in the above example. The class_name will be set to TupleS3StoreBackend, bucket will be set to the address of your S3 bucket, and prefix will be set to the folder in your S3 bucket where Validation results will be located.

For the example above, note that the new Store's name is set to validations_S3_store. This can be any name you like, as long as you also update the value of the validations_store_name key to match the new Store's name.

                            File contents: great_expectations.yml
                          
validations_store_name: validations_S3_store

This update to the value of the validations_store_name key will tell Great Expectations to use the new Store for Validation Results.

caution

If you are also storing ExpectationsA verifiable assertion about data. in S3 (How to configure an Expectation store to use Amazon S3), or DataDocs in S3 (How to host and share Data Docs on Amazon S3), then please ensure that the prefix values are disjoint and one is not a substring of the other.

1.5.3 Verify that the new Amazon S3 Validation Results Store has been added successfully

You can verify your active Stores are configured correctly by running the terminal command:

Terminal input

great_expectations store list

This will list the currently configured Stores that Great Expectations has access to. If you added a new S3 Validation Results Store, the output should include the following ValidationStore entry:

Terminal output
- name: validations_S3_store
 class_name: ValidationsStore
 store_backend:
   class_name: TupleS3StoreBackend
   bucket: '<your_s3_bucket_name>'
   prefix: '<your_s3_bucket_folder_name>'

                              
                            

Please note that the great_expectations store list command will specifically list your active Stores, which are the ones specified by expectations_store_name, validations_store_name, evaluation_parameter_store_name, and checkpoint_store_name in the file great_expectations.yml. These are the Stores that your Data Context will use by default.

To make Great Expectations look for Validation Results on the S3 bucket, you must set the validations_store_name variable to the name of your S3 Validations Store, which in our example is validations_s3_store.

Additional options are available for a more fine-grained customization of the TupleS3StoreBackend.

                            File contents: great_expectations.yml
                          
class_name: ValidationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'
  boto3_options:
    endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
    region_name: '<your_aws_region_name>'

                              
                            

1.5.4 (Optional) Copy existing Validation results to the Amazon S3 bucket

If you are converting an existing local Great Expectations deployment to one that works in AWS you may already have Validation Results saved that you wish to keep and transfer to your S3 bucket.

You can copy Validation Results into Amazon S3 is by using the aws s3 sync command. As mentioned earlier, the base_directory is set to uncommitted/validations/ by default.

Terminal input
aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

In the example below, two Validation Results, Validation1 and Validation2 are copied to Amazon S3. This results in the following output:

Terminal output
upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json

If you have Validation Results to copy into S3, your output should look similar.

1.6.1 Create an Amazon S3 bucket for your Data Docs

You can create an S3 bucket configured for a specific location using the AWS CLI. Make sure you modify the bucket name and region for your situation.

Terminal input
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
    "Location": "/data-docs.my_org"
}

                              
                            

1.6.2 Configure your bucket policy to enable appropriate access

The example policy below enforces IP-based access - modify the bucket name and IP addresses for your situation. After you have customized the example policy to suit your situation, save it to a file called ip-policy.json in your local directory.

caution

Your policy should provide access only to appropriate users. Data Docs sites can include critical information about raw data and should generally not be publicly accessible.

                            File content: ip-policy.json
                          
  {
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "Allow only based on source IP",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": [
        "arn:aws:s3:::data-docs.my_org",
        "arn:aws:s3:::data-docs.my_org/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "192.168.0.1/32",
            "2001:db8:1234:1234::/64"
          ]
        }
      }
    }
    ]
  }

                              
                            

tip

Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/* path in the Resource list along with the arn:aws:s3:::{your_data_docs_site} path that permits access to your Data Docs' front page.

REMINDER

Amazon Web Service's S3 buckets are a third party utility. For more (and the most up to date) information on configuring AWS S3 bucket policies, please refer to Amazon's guide on using bucket policies.

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket

Run the following AWS CLI command to apply the policy:

Terminal input

> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json

1.6.4 Add a new Amazon S3 site to the `data_docs_sites` section of your `great_expectations.yml`

The below example shows the default local_site configuration that you will find in your great_expectations.yml file, followed by the s3_site configuration that you will need to add. You may optionally remove the default local_site configuration completely and replace it with the new s3_site configuration if you would only like to maintain a single S3 Data Docs site.

                            File content: great_expectations.yml
                          
data_docs_sites:
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
  s3_site:  # this is a user-selected name - you may select your own
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-docs.my_org  # UPDATE the bucket name here to match the bucket you configured above.
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
      show_cta_footer: true

                              
                            

1.6.5 Test that your Data Docs configuration is correct by building the site

Use the following CLI command: great_expectations docs build --site-name s3_site to build and open your newly configured S3 Data Docs site.

Terminal input

> great_expectations docs build --site-name s3_site

You will be presented with the following prompt:

Terminal output
The following Data Docs sites will be built:

 - s3_site: https://s3.amazonaws.com/data-docs.my_org/index.html

Would you like to proceed? [Y/n]:

Signify that you would like to proceed by pressing the return key or entering Y. Once you have you will be presented with the following messages:

Terminal output
Building Data Docs...

Done building Data Docs

If successful, the CLI will also open your newly built S3 Data Docs site and provide the URL, which you can share as desired. Note that the URL will only be viewable by users with IP addresses appearing in the above policy.

tip

You may want to use the -y/--yes/--assume-yes flag with the great_expectations docs build --site-name s3_site command. This flag causes the CLI to skip the confirmation dialog.

This can be useful for non-interactive environments.

Additional notes on hosting Data Docs from an Amazon S3 bucket

Optionally, you may wish to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
```
> aws s3 website s3://data-docs.my_org/ --index-document index.html
```

If you wish to host a Data Docs site in a subfolder of an S3 bucket, add the prefix property to the configuration snippet in step 4, immediately after the bucket property.
If you wish to host a Data Docs site through a private DNS, you can configure a base_public_path for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with the base_public_path set to www.mydns.com. Data Docs will still be written to the configured location on S3 (for example https://s3.amazonaws.com/data-docs.my_org/docs/index.html), but you will be able to access the pages from your DNS (http://www.mydns.com/index.html in our example)
```
data_docs_sites:
  s3_site:  # this is a user-selected name - you may select your own
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-docs.my_org  # UPDATE the bucket name here to match the bucket you configured above.
      base_public_path: http://www.mydns.com
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
      show_cta_footer: true
```

Part 2: Connect to data

2.1 Choose how to run the code for creating a new Datasource

The previous sections of this guide involved manually editing configuration files to add configurations for Amazon S3 buckets. When setting up your Datasource configurations, it is simpler to use Great Expectation's Python API. We recommend doing this from a Jupyter Notebook, as you will then receive immediate feedback on the results of your code blocks. However, you can alternatively use a Python script in the IDE of your choice.

If you would like, you may use the Great Expectations CLICommand Line Interface to automatically generate a pre-configured Jupyter Notebook. To do so, run the following console command from the root directory of your Great Expectations project:

great_expectations datasource new

Once you have your pre-configured Jupyter Notebook, you should follow along in the YAML-based workflow in the following steps.

If you choose to work from a blank Jupyter Notebook or a Python script, you may find it easier to use the following Python dictionary workflow over the YAML workflow. Great Expectations supports either configuration method.

2.2 Instantiate your project's DataContext

Import these necessary packages and modules.

from ruamel import yaml

import great_expectations as gx
from great_expectations.core.batch import Batch, BatchRequest, RuntimeBatchRequest

Load your DataContext into memory

Use one of the guides below based on your deployment:

Please proceed only after you have instantiated your DataContext.

2.3 Configure your Datasource

Using this example configuration, add in your S3 bucket and path to a directory that contains some of your data:

YAML
Python

name: my_s3_datasource
class_name: Datasource
execution_engine:
    class_name: SparkDFExecutionEngine
data_connectors:
    default_runtime_data_connector_name:
        class_name: RuntimeDataConnector
        batch_identifiers:
            - default_identifier_name
    default_inferred_data_connector_name:
        class_name: InferredAssetS3DataConnector
        bucket: <your_s3_bucket_here>
        prefix: <bucket_path_to_data>
        default_regex:
            pattern: (.*)\.csv
            group_names:
                - data_asset_name

                                    
                                  

Run this code to test your configuration.

context.test_yaml_config(datasource_yaml)

datasource_config = {
    "name": "my_s3_datasource",
    "class_name": "Datasource",
    "execution_engine": {"class_name": "SparkDFExecutionEngine"},
    "data_connectors": {
        "default_runtime_data_connector_name": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"],
        },
        "default_inferred_data_connector_name": {
            "class_name": "InferredAssetS3DataConnector",
            "bucket": "<your_s3_bucket_here>",
            "prefix": "<bucket_path_to_data>",
            "default_regex": {
                "pattern": "(.*)\\.csv",
                "group_names": ["data_asset_name"],
            },
        },
    },
}

                                    
                                  

Run this code to test your configuration.

context.test_yaml_config(yaml.dump(datasource_config))

If you specified an S3 path containing CSV files you will see them listed as Available data_asset_names in the output of test_yaml_config().

Feel free to adjust your configuration and re-run test_yaml_config() as needed.

2.4 Save the Datasource configuration to your DataContext

Save the configuration into your DataContext by using the add_datasource() function.

YAML
Python

context.add_datasource(**yaml.load(datasource_yaml))

context.add_datasource(**datasource_config)

2.5 Test your new Datasource

Verify your new DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. by loading data from it into a ValidatorUsed to run an Expectation Suite against data. using a Batch RequestProvided to a Datasource in order to create a Batch..

Specify an S3 path to single CSV
Specify a data_asset_name

Add the S3 path to your CSV in the path key under runtime_parameters in your RuntimeBatchRequest.

tip

The path you will want to use is your S3 URI, not the URL.

batch_request = RuntimeBatchRequest(
    datasource_name="version-0.15.50 my_s3_datasource",
    data_connector_name="version-0.15.50 default_runtime_data_connector_name",
    data_asset_name="version-0.15.50 <your_meangingful_name>",  # this can be anything that identifies this data_asset for you
    runtime_parameters={"path": "<path_to_your_data_here>"},  # Add your S3 path here.
    batch_identifiers={"default_identifier_name": "default_identifier"},
)

                                    
                                  

Then load data into the Validator.

context.add_or_update_expectation_suite(expectation_suite_name="version-0.15.50 test_suite")
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="version-0.15.50 test_suite"
)
print(validator.head())

                                    
                                  

Add the name of the Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. to the data_asset_name in your BatchRequest.

batch_request = BatchRequest(
    datasource_name="version-0.15.50 my_s3_datasource",
    data_connector_name="version-0.15.50 default_inferred_data_connector_name",
    data_asset_name="version-0.15.50 <your_data_asset_name>",
    batch_spec_passthrough={"reader_method": "csv", "reader_options": {"header": True}},
)

                                    
                                  

Then load data into the Validator.

context.add_or_update_expectation_suite(expectation_suite_name="version-0.15.50 test_suite")
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="version-0.15.50 test_suite"
)
print(validator.head())

                                    
                                  

Part 3: Create Expectations

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator

When we tested our Datasource in step 2.5: Test your new Datasource we also created all of the components we need to begin creating Expectations: A Batch Request to provide sample data we can test our new Expectations against, an empty Expectation Suite to contain our new Expectations, and a Validator to create those Expectations with.

We can reuse those components now. Alternatively, you may follow the same process that we did before and define a new Batch Request, Expectation Suite, and Validator if you wish to use a different Batch of data as the reference sample when you are creating Expectations or if you wish to use a different name than test_suite for your Expectation Suite.

3.2: Use a Validator to add Expectations to the Expectation Suite

There are many Expectations available for you to use. To demonstrate creating an Expectation through the use of the Validator we defined earlier, here are examples of the process for two of them:

validator.expect_column_values_to_not_be_null(column="passenger_count")

validator.expect_column_values_to_be_between(
    column="congestion_surcharge", min_value=0, max_value=1000
)

Each time you evaluate an Expectation (e.g. via validator.expect_*) two things will happen. First, the Expectation will immediately be Validated against your provided Batch of data. This instant feedback helps to zero in on unexpected data very quickly, taking a lot of the guesswork out of data exploration. Second, the Expectation configuration will be stored in the Expectation Suite you provided when the Validator was initialized.

This is the same method of interactive Expectation Suite editing used in the CLI interactive mode notebook accessed via great_expectations suite new --interactive. For more information, see our documentation on How to create and edit Expectations with instant feedback from a sample Batch of data.

You can also create Expectation Suites using a Data Assistant to automatically create expectations based on your data or manually using domain knowledge and without inspecting data directly.

To find out more about the available Expectations, please see our Expectations Gallery.

3.3: Save the Expectation Suite

When you have run all of the Expectations you want for this dataset, you can call validator.save_expectation_suite() to save the Expectation Suite (all of the unique Expectation Configurations from each run of validator.expect_*)for later use in a Checkpoint.

validator.save_expectation_suite(discard_failed_expectations=False)

Part 4: Validate Data

4.1: Create and run a Checkpoint

Here we will create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for our Batch, which we can use to validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it.

Checkpoints are a robust resource that can be preconfigured with a Batch Request and Expectation Suite or take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when the Checkpoint is run.

This guide will demonstrate using a SimpleCheckpoint that takes in a Batch Request and Expectation Suite as parameters for the context.run_checkpoint(...) command.

tip

For more information on pre-configuring a Checkpoint with a Batch Request and Expectation Suite, please see our guides on Checkpoints.

4.1.1 Create a Checkpoint

First we create the Checkpoint configuration:

YAML
Python

my_checkpoint_name = "version-0.15.50 insert_your_checkpoint_name_here"
my_checkpoint_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
"""

                                    
                                  

my_checkpoint_name = "version-0.15.50 insert_your_checkpoint_name_here"
checkpoint_config = {
    "name": my_checkpoint_name,
    "config_version": 1.0,
    "class_name": "SimpleCheckpoint",
    "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
}

                                    
                                  

Once we have defined our Checkpoint configuration, we can test our syntax using context.test_yaml_config(...):

YAML
Python

my_checkpoint = context.test_yaml_config(my_checkpoint_config)

my_checkpoint = context.test_yaml_config(yaml.dump(checkpoint_config))

Note that we get a message that the Checkpoint contains no validations. This is OK because we will pass them in at runtime, as we can see below when we call context.run_checkpoint(...).

4.1.2 Save the Checkpoint

After using context.test_yaml_config(...) to verify that all is well, we can add the Checkpoint to our Data Context:

YAML
Python

context.add_or_update_checkpoint(**yaml.load(my_checkpoint_config))
context.add_checkpoint(**yaml.load(my_checkpoint_config))

context.add_or_update_checkpoint(**checkpoint_config)

4.1.3 Run the Checkpoint

Finally, having added our Checkpoint to our Data Context, we will run the Checkpoint. Since we did not pre-configure the Checkpoint with a Batch Request and Expectation Suite, we will pass those in as a list item in the parameter validations:

checkpoint_result = context.run_checkpoint(
    checkpoint_name=my_checkpoint_name,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ],
)

                              
                            

4.2: Build and view Data Docs

Since we used a SimpleCheckpoint, our Checkpoint already contained an UpdateDataDocsAction which rendered our Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. from the Validation Results we just generated. That means our Data Docs store will contain a new entry for the rendered Validation Result.

tip

For more information on Actions that Checkpoints can perform and how to add them, please see our guides on Actions.

Viewing this new entry is as simple as running:

context.open_data_docs()

Congratulations!

🚀🚀 Congratulations! 🚀🚀 You have successfully navigated the entire workflow for using Great Expectations with Amazon Web Services S3 and Spark, from installing Great Expectations through Validating your Data.

This guide assumes you have:

Steps​

Part 1: Setup​

1.1 Ensure that the AWS CLI is ready for use​

1.1.1 Verify that the AWS CLI is installed​

1.1.2 Verify that your AWS credentials are properly configured​

1.2 Prepare a local installation of Great Expectations and necessary dependencies​

1.2.1 Verify that your Python version meets requirements​

1.2.2 Create a virtual environment for your Great Expectations project​

1.2.3 Ensure you have the latest version of pip​

1.2.4 Install boto3​

1.2.5 Install Spark dependencies for S3​

1.2.6 Install Great Expectations​

1.2.7 Verify that Great Expectations installed successfully​

1.3 Create your Data Context​

1.4 Configure your Expectations Store on Amazon S3​

1.4.1 Identify your Data Context Expectations Store​

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3​

1.4.3 Verify that the new Amazon S3 Expectations Store has been added successfully​

1.4.4 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket​

1.4.5 (Optional) Verify that copied Expectations can be accessed from Amazon S3​

1.5 Configure your Validation Results Store on Amazon S3​

1.5.1 Identify your Data Context's Validation Results Store​

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3​

1.5.3 Verify that the new Amazon S3 Validation Results Store has been added successfully​

1.5.4 (Optional) Copy existing Validation results to the Amazon S3 bucket​

1.6 Configure Data Docs for hosting and sharing from Amazon S3​

1.6.1 Create an Amazon S3 bucket for your Data Docs​

1.6.2 Configure your bucket policy to enable appropriate access​

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket​

1.6.4 Add a new Amazon S3 site to the data_docs_sites section of your great_expectations.yml​

1.6.5 Test that your Data Docs configuration is correct by building the site​

Additional notes on hosting Data Docs from an Amazon S3 bucket​

Part 2: Connect to data​

2.1 Choose how to run the code for creating a new Datasource​

2.2 Instantiate your project's DataContext​

2.3 Configure your Datasource​

2.4 Save the Datasource configuration to your DataContext​

2.5 Test your new Datasource​

Part 3: Create Expectations​

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator​

3.2: Use a Validator to add Expectations to the Expectation Suite​

3.3: Save the Expectation Suite​

Part 4: Validate Data​

4.1: Create and run a Checkpoint​

4.1.1 Create a Checkpoint​

4.1.2 Save the Checkpoint​

4.1.3 Run the Checkpoint​

4.2: Build and view Data Docs​

Congratulations!​

Steps

Part 1: Setup

1.1 Ensure that the AWS CLI is ready for use

1.1.1 Verify that the AWS CLI is installed

1.1.2 Verify that your AWS credentials are properly configured

1.2 Prepare a local installation of Great Expectations and necessary dependencies

1.2.1 Verify that your Python version meets requirements

1.2.2 Create a virtual environment for your Great Expectations project

1.2.3 Ensure you have the latest version of pip

1.2.4 Install boto3

1.2.5 Install Spark dependencies for S3

1.2.6 Install Great Expectations

1.2.7 Verify that Great Expectations installed successfully

1.3 Create your Data Context

1.4 Configure your Expectations Store on Amazon S3

1.4.1 Identify your Data Context Expectations Store

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3

1.4.3 Verify that the new Amazon S3 Expectations Store has been added successfully

1.4.4 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket

1.4.5 (Optional) Verify that copied Expectations can be accessed from Amazon S3

1.5 Configure your Validation Results Store on Amazon S3

1.5.1 Identify your Data Context's Validation Results Store

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3

1.5.3 Verify that the new Amazon S3 Validation Results Store has been added successfully

1.5.4 (Optional) Copy existing Validation results to the Amazon S3 bucket

1.6 Configure Data Docs for hosting and sharing from Amazon S3

1.6.1 Create an Amazon S3 bucket for your Data Docs

1.6.2 Configure your bucket policy to enable appropriate access

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket

1.6.4 Add a new Amazon S3 site to the `data_docs_sites` section of your `great_expectations.yml`

1.6.5 Test that your Data Docs configuration is correct by building the site

Additional notes on hosting Data Docs from an Amazon S3 bucket

Part 2: Connect to data

2.1 Choose how to run the code for creating a new Datasource

2.2 Instantiate your project's DataContext

2.3 Configure your Datasource

2.4 Save the Datasource configuration to your DataContext

2.5 Test your new Datasource

Part 3: Create Expectations

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator

3.2: Use a Validator to add Expectations to the Expectation Suite

3.3: Save the Expectation Suite

Part 4: Validate Data

4.1: Create and run a Checkpoint

4.1.1 Create a Checkpoint

4.1.2 Save the Checkpoint

4.1.3 Run the Checkpoint

4.2: Build and view Data Docs

Congratulations!