How to configure DataContext components using test_yaml_config
test_yaml_config
is a convenience method
for configuring the moving parts of a Great
Expectations deployment. It allows you to quickly test
out configs for
DatasourcesProvides a standard API for accessing and
interacting with data from a wide variety of
source systems.
and
StoresA connector to store and retrieve information
about metadata in Great Expectations.. For many deployments of Great Expectations, these
components (plus
ExpectationsA verifiable assertion about data.) are the only ones you'll need.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Set up a working deployment of Great Expectations
test_yaml_config
is primarily intended
for use within a notebook, where you can iterate
through an edit-run-check loop in seconds.
Steps
-
Instantiate a DataContext
Create a new Jupyter Notebook and instantiate a DataContext by running the following lines:
import great_expectations as ge
context = ge.get_context() -
Create or copy a yaml config
You can create your own, or copy an example. For this example, we'll demonstrate using a Datasource that connects to PostgreSQL.
config = """
class_name: SimpleSqlalchemyDatasource
credentials:
drivername: postgresql
username: postgres
password: ""
host: localhost
port: 5432
database: test_ci
introspection:
whole_table: {}
""" -
Run context.test_yaml_config.
context.test_yaml_config(
name="my_postgresql_datasource",
yaml_config=config
)When executed,
test_yaml_config
will instantiate the component and run through aself_check
procedure to verify that the component works as expected.In the case of a Datasource, this means
- confirming that the connection works,
- gathering a list of available DataAssets (e.g. tables in SQL; files or folders in a filesystem), and
- verifying that it can successfully fetch at least one BatchA selection of records from a Data Asset. from the source.
The output will look something like this:
Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is SimpleSqlalchemyDatasource
Successfully instantiated SimpleSqlalchemyDatasource
Execution engine: SqlAlchemyExecutionEngine
Data connectors:
whole_table : InferredAssetSqlDataConnector
Available data_asset_names (3 of 14440):
expect_table_row_count_to_equal_other_table_data_1 (1 of 1): [{}]
expect_table_row_count_to_equal_other_table_data_2 (1 of 1): [{}]
expect_table_row_count_to_equal_other_table_data_3 (1 of 1): [{}]
Unmatched data_references (0 of 0): []
Choosing an example data reference...
Reference chosen: {}
Fetching batch data...
Showing 5 rows
c1 c2 c3 c4
0 4 a None 4.0
1 5 b None 3.0
2 6 c None 3.5
3 7 d None 1.2
<great_expectations.datasource.simple_sqlalchemy_datasource.SimpleSqlalchemyDatasource at 0x12c1e4d50>If something about your configuration wasn't set up correctly,
test_yaml_config
will raise an error. Whenever possible, test_yaml_config provides helpful warnings and error messages. It can't solve every problem, but it can solve many.Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is SimpleSqlalchemyDatasource
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
~/anaconda2/envs/py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
2338 try:
-> 2339 return fn()
2340 except dialect.dbapi.Error as e:
...
OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5433?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5433?
(Background on this error at: http://sqlalche.me/e/13/e3q8) -
Iterate as necessary.
From here, iterate by editing your config and re-running
test_yaml_config
, adding config blocks for additional introspection, Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification., sampling, etc. -
(Optional:) Test additional methods.
Note that when
test_yaml_config
runs successfully, it adds the specified Datasource to yourDataContext
. This means that you can also test other methods, such ascontext.get_validator
, right within your notebook:validator = context.get_validator(
datasource_name="my_datasource",
data_connector_name="whole_table",
data_asset_name="my_table",
create_expectation_suite_with_name="my_expectation_suite",
)
validator.expect_column_values_to_be_in_set("c1", [4,5,6]) -
Save the config.
Once you are satisfied with your config, you can make it a permanent part of your Great Expectations setup by copying it into the appropriate section of your
great_expectations/great_expectations.yml
file.datasources:
my_datasource:
class_name: SimpleSqlalchemyDatasource
credentials:
drivername: postgresql
username: postgres
password: ""
host: localhost
port: 5432
database: test_ci
introspection:
whole_table: {} -
Check your modified config.
In a fresh notebook, test your edited config file by re-instantiating your DataContext:
context = ge.get_context()
validator = context.get_validator(
datasource_name="my_datasource",
data_connector_name="whole_table",
data_asset_name="my_table",
create_expectation_suite_with_name="my_expectation_suite",
)
validator.expect_column_values_to_be_in_set("c1", [4,5,6])