Get started with Great Expectations and SQL
Use the information provided here to learn how you can use Great Expectations (GX) with a SQL Data Source. The following examples use a PostgreSQL Database.
To use GX with PostgreSQL Database, you'll complete the following tasks:
- Load data
- Instantiate a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.
- Create a Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. and a Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification.
- Create an Expectation SuiteA collection of verifiable assertions about data.
- Validate data using a CheckpointThe primary means for validating data in a production deployment of Great Expectations.
The full code used in the following examples is available on GitHub:
Prerequisites
- Completed the Quickstart guide
- A working PostgreSQL Database
- A working Python environment
Install GX
-
Run the following command to install GX in your Python environment:
pip install great-expectations
-
Run the following command to import configuration information that you'll use in the following steps:
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
Set up GX
To avoid configuring external resources, you'll use your local filesystem for your Metadata Stores and Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. store.
Run the following code to create a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. with the default settings:
context = gx.get_context()
Connect to your data
-
Use a
connection_string
to securely connect to your PostgreSQL instance. For examplePG_CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/taxi_db"
Replace the connection string with the connection string for your database. For additional information about other connection methods, see How to configure credentials. In this example, existing New York City taxi cab data is being used.
-
Run the following command to create a Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. to represent the data available in your PostgreSQL database:
pg_datasource = context.sources.add_sql(
name="pg_datasource", connection_string=PG_CONNECTION_STRING
) -
Run the following command to create a Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. to represent a discrete set of data:
pg_datasource.add_table_asset(
name="postgres_taxi_data", table_name="postgres_taxi_data"
)In this example, the name of a specific table within your database is used.
-
Run the following command to build a Batch RequestProvided to a Datasource in order to create a Batch. using the Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. you configured previously:
batch_request = pg_datasource.get_asset("postgres_taxi_data").build_batch_request()
Create Expectations
You'll use a ValidatorUsed to run an Expectation Suite against data. to interact with your batch of data and generate an Expectation SuiteA collection of verifiable assertions about data..
Every time you evaluate an Expectation with
validator.expect_*
, it is immediately
Validated against your data. This instant feedback
helps you identify unexpected data and removes the
guesswork from data exploration. The Expectation
configuration is stored in the Validator. When you are
finished running the Expectations on the dataset, you
can use
validator.save_expectation_suite()
to
save all of your Expectation configurations into an
Expectation Suite for later use in a checkpoint.
-
Run the following command to create the suite and get a
Validator
:expectation_suite_name = "insert_your_expectation_suite_name_here"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name=expectation_suite_name,
)
print(validator.head()) -
Run the following command to use the
Validator
to add a few Expectations:validator.expect_column_values_to_not_be_null(column="passenger_count")
validator.expect_column_values_to_be_between(
column="congestion_surcharge", min_value=0, max_value=1000
) -
Run the following command to save your Expectation Suite (all the unique Expectation Configurations from each run of
validator.expect_*
) to your Expectation Store:validator.save_expectation_suite(discard_failed_expectations=False)
Validate your data
You'll create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for your batch, which you can use to validate and run post-validation actions.
-
Run the following command to create the Checkpoint configuration that uses your Data Context:
my_checkpoint_name = "my_sql_checkpoint"
checkpoint = Checkpoint(
name=my_checkpoint_name,
run_name_template="%Y%m%d-%H%M%S-my-run-name-template",
data_context=context,
batch_request=batch_request,
expectation_suite_name=expectation_suite_name,
action_list=[
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{"name": "update_data_docs", "action": {"class_name": "UpdateDataDocsAction"}},
],
) -
Run the following command to save the Checkpoint:
context.add_or_update_checkpoint(checkpoint=checkpoint)
-
Run the following command to run the Checkpoint and pass in your Batch Request (your data) and your Expectation Suite (your tests):
checkpoint_result = checkpoint.run()
Your Checkpoint configuration includes the
store_validation_result
andupdate_data_docs
actions. Thestore_validation_result
action saves your validation results from the Checkpoint run and allows the results to be persisted for future use. Theupdate_data_docs
action builds Data Docs files for the validations run in the Checkpoint.To learn more about Data validation and customizing Checkpoints, see Validate Data: Overview .
To view the full Checkpoint configuration, run
print(checkpoint.get_config().to_yaml_str())
.
Build and view Data Docs
Your Checkpoint contained an
UpdateDataDocsAction
, so your
Data DocsHuman readable documentation generated from Great
Expectations metadata detailing Expectations,
Validation Results, etc.
have already been built from the validation you ran
and your Data Docs store contains a new rendered
validation result.
Run the following command to open your Data Docs and review the results of your Checkpoint run:
context.open_data_docs()
Next steps
Now that you've created and saved a Data Context, Data Source, Data Asset, Expectation Suite, and Checkpoint, see Validate data with Expectations and Checkpoints to create a script to run the Checkpoint without the need to recreate your Data Assets and Expectations.