How to configure a DataConnector to introspect and partition tables in SQL
This guide will help you introspect and partition
tables in an SQL database using
SimpleSqlalchemyDatasource
, which
operates as a proxy to
InferredAssetSqlDataConnector
and
ConfiguredAssetSqlDataConnector
. For
background, please see the
Datasource specific guides
in the
Connecting to your data
section of our documentation.
The SQL database introspection and partitioning are useful for:
- Exploring the schema and column metadata of the tables in your SQL database, and
- Organizing the tables into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the partitioning considerations informed by this exploration.
Partitioning enables you to select the desired subsets of your dataset for Validation.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
We will use the "Yellow Taxi" dataset to
walk you through the configuration of
SimpleSqlalchemyDatasource
, where the
introspection
section characterizes
InferredAssetSqlDataConnector
objects and
the tables
section characterizes
ConfiguredAssetSqlDataConnector
objects.
Starting with the bare-bones version of either the
introspection
section or the
tables
section of the
SimpleSqlalchemyDatasource
configuration,
we gradually build out the configuration to achieve
the introspection of your SQL database with the
semantics consistent with your goals.
Only introspection
and
tables
are the legal top-level keys
in the
SimpleSqlalchemyDatasource
configuration.
To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets., and Batch(es)A selection of records from a Data Asset., please see our Datasources Core Concepts Guide in the Core Concepts reference guide.
Preliminary Steps
1. Instantiate your project's DataContext
Import Great Expectations.
import great_expectations as ge
2. Obtain DataContext
Load your DataContext into memory using the
get_context()
method.
context = ge.get_context()
Configuring Introspection and Tables
- Introspection (InferredAssetSqlDataConnector)
- Tables (ConfiguredAssetSqlDataConnector)
1. Configure your SimpleSqlalchemyDatasource for introspection
Start with an elementary
SimpleSqlalchemyDatasource
configuration, containing just a basic
introspection
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
introspection: # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table: {{}} # Any alphanumeric key name is acceptable.
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of
exactly one
InferredAssetSqlDataConnector
(called whole_table
, as per the
configuration).
An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
to ensure the correctness of the proposed
YAML
configuration prior to
incorporating it and trying to use it.
For instance, try the following erroneous
SimpleSqlalchemyDatasource
configuration (it contains an illegal top-level
key):
datasource_yaml = f""" # buggy datasource_yaml configuration
name: mis_configured_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
introspecting: # illegal top-level key name
whole_table: {{}}
"""
Then specify the connection string for your database, and again run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports an empty Data Connectors list, signaling a misconfiguration.
Feel free to experiment with the arguments to
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
For instance, running
context.test_yaml_config(yaml_string, return_mode="report_object")
will return the information appearing in
standard output converted to the
Python
dictionary format.
Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running
context.test_yaml_config(yaml_string, shorten_tracebacks=True)
showing the line numbers, where the exception
occurred, most likely caused by the failure of
the required class (in this case
InferredAssetSqlDataConnector
) from
being successfully instantiated.
2. Customize the introspection configuration to fit your needs
SimpleSqlalchemyDatasource
supports
a number of configuration options to assist you
with the introspection
of your SQL
database:
-
the database views will included in the list
of identified
Data References
(by setting theinclude_views
flag totrue
) -
if any exceptions occur during the
introspection
operation, then the process will continue (by setting theskip_inapplicable_tables
flag totrue
) -
specifying
excluded_tables
will have the effect of excluding only the tables on this list, while including the rest -
specifying
included_tables
will have the effect of including only the tables on this list, while excluding the rest
The following YAML
configurqation
example utilizes several of these configuration
directives:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
introspection: # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table:
include_schema_name: true
introspection_directives:
include_views: true
skip_inapplicable_tables: true # skip and continue upon encountering introspection errors
excluded_tables: # a list of tables to ignore when inferring data asset_names
- main.yellow_tripdata_sample_2019_03 # format: schema_name.table_name
"""
3. Save the Datasource configuration to your DataContext
Once the
SimpleSqlalchemyDatasource
configuration is error-free and satisfies your
requirements, save it into your
DataContext
by using the
add_datasource()
function.
context.add_datasource(**yaml.load(datasource_yaml))
4. Get names of available Data Assets
Getting names of available data assets using an
InferredAssetSqlDataConnector
affords you the visibility into types and naming
structures of tables in your SQL database:
available_data_asset_names = context.datasources[
"taxi_datasource"
].get_available_data_asset_names(data_connector_names="whole_table")["whole_table"]
1. Configure your SimpleSqlalchemyDatasource to characterize tables
Start with an elementary
SimpleSqlalchemyDatasource
configuration, containing just a basic
tables
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
yellow_tripdata_sample_2019_01: # Must match table name exactly.
partitioners: # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table: {{}}
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of
exactly one
ConfiguredAssetSqlDataConnector
(called whole_table
, as per the
configuration) and that
Available data_asset_names (1 of 1)
, the name of the single
Data Asset
being
yellow_tripdata_sample_2019_01
.
2. Enhance your SimpleSqlalchemyDatasource with ability to attribute metadata to tables
Add Data Asset Name
identification
attributes (data_asset_name_prefix
and data_asset_name_suffix
) and set
the include_schema_name
flag in
your
ConfiguredAssetSqlDataConnector
(named whole_table
) configuration
section. These directives will result in the
reported properties of your table to contain
annotations, customized for your purposes:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <CONNECTION_STRING>
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
yellow_tripdata_sample_2019_01: # Must match table name exactly.
partitioners: # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table:
include_schema_name: true
schema_name: main
data_asset_name_prefix: taxi__
data_asset_name_suffix: __asset
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of
exactly one
ConfiguredAssetSqlDataConnector
(called whole_table
, as per the
configuration) and that
Available data_asset_names (1 of 1)
, the name of the single
Data Asset
this time being
taxi__yellow_tripdata_sample_2019_01__asset
, correctly reflecting the enhanced
configuration directives.
Finally, once your
Data Connector
configuration
satisfies your requirements, save the enclosing
Datasource
into your
DataContext
using
context.add_datasource(**yaml.load(datasource_yaml))
To view the full script used in this page, see it on GitHub: