How to configure a SQL Datasource
|
|
|
|
|
|
|
|
This guide will walk you through the process of configuring a SQL Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a SQL Datasource which you can use in future workflows for creating Expectations and Validating data.
Steps
1. Import necessary modules and initialize your Data Context
from ruamel import yaml
import great_expectations as gx
data_context: gx.DataContext = gx.get_context()
The great_expectations module will give
you access to your Data Context, which is the entry
point for working with a Great Expectations project.
The yaml module from
ruamel will be used in validating your
Datasource's configuration. Great Expectations
will use a Python dictionary representation of your
Datasource configuration when you add your Datasource
to your Data Context. However, Great Expectations
saves configurations as yaml files, so
when you validate your configuration you will need to
convert it from a Python dictionary to a
yaml string, first.
Your Data Context that is initialized by
get_data_context() will be the Data
Context defined in your current working directory. It
will provide you with convenience methods that we will
use to validate your Datasource configuration and add
your Datasource to your Great Expectations project
once you have configured it.
2. Create a new Datasource configuration.
A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.
To start, create an empty dictionary. You will be populating it with keys as you go forward.
At this point, the configuration for your Datasource is merely:
datasource_config: dict = {}
However, from this humble beginning you will be able to build a full Datasource configuration.
The keys needed for your Datasource configuration
At the top level, your Datasource's configuration will need the following keys:
-
name: The name of the Datasource, which will be used to reference the datasource in Batch Requests. -
class_name: The name of the Python class instantiated by the Datasource. Typically, this will be theDatasourceclass. -
module_name: the name of the module that contains the Class definition indicated byclass_name. -
execution_engine: a dictionary containing theclass_nameandmodule_nameof the Execution Engine instantiated by the Datasource. -
data_connectors: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.
In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.
3. Name your Datasource
The first key that you will need to define for your
new Datasource is its name. You will use
this to reference the Datasource in future workflows.
It can be anything you want it to be, but ideally you
will name it something relevant to the data that it
interacts with.
For the purposes of this example, we will name this Datasource:
"name": "my_datasource_name", # Preferably name it something relevant
You should, however, name your Datsource something more relevant to your data.
At this point, your configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
}
4. Specify the Datasource class and module
The class_name and
module_name for your Datasource will
almost always indicate the
Datasource class found at
great_expectations.datasource. You may
replace this with a specialized subclass, or a custom
class, but for almost all regular purposes these two
default values will suffice. For the purposes of this
guide, add those two values to their corresponding
keys.
"class_name": "Datasource",
"module_name": "great_expectations.datasource"
Your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
}
5. Add the SqlAlchemy Execution Engine to your Datasource configuration
Your Execution Engine is where you will specify that
you want this Datasource to use
SQL
in the backend. As with the Datasource top level
configuration, you will need to provide the
class_name and
module_name that indicate the class
definition and containing module for the Execution
Engine that you will use.
For the purposes of this guide, these will consist of
the SqlAlchemyExecutionEngine found at
great_expectations.execution_engine. The
execution_engine key and its
corresponding value will therefore look like this:
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
Additionally, your
execution_engine dictionary will require
a values for either connection_string or
credentials. You will only need to use
one of these keys as they each serve the same purpose:
to provide the parameters necessary for the
SqlAlchemyExecutionEngine to connect to your desired
database. For the purposes of this guide we will use
the connection_string key, the value of
which will be the string representation of the
information necessary to connect to your SQL database.
At this point your configuration should look like:
"connection_string": CONNECTION_STRING,
Your connection string will vary depending on the type of SQL database you are connecting to. For more information on how to configure your connection string, please see the appropriate guide for connecting to a specific Database.
After adding the above snippets to your Datasource configuration, your full configuration dictionary should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
}
credentials instead of
connection_string
The credentials key uses a dictionary
to provide the elements of your connection string
as separate, individual values. For information on
how to populate the
credentials dictionary and how to
configure your
great_expectations.yml project config
file to populate credentials from either a YAML
file or a secret manager, please see our guide on
How to configure credentials.
In addition to substituting
credentials for
connection_string, there are a few
other optional keys that you can add to your
execution_engine dictionary.
The first of these is the optional key
create_temp_table. Creating a temp
table to improve query efficiency is the default
behaviour of the
SqlAlchemyExecutionEngine. However,
you can change this to False if you
don't have the permissions to create a temp
table on the database you are working with.
The other valid keys are not actually defined by
the Great Expectations code. Instead, any other
key/value pairs that you add to the Execution
Engine configuration will be passed directly to
SqlAlchemy's
create_engine(...) method. This
allows you to fully leverage all the capabilities
supported by SqlAlchemy through that method.
For more information on what keys
create_engine(...) will accept as
parameters and how you can utilize this feature,
please refer to
SqlAlchemy's documentation.
6. Add a dictionary as the value of the
data_connectors key
The data_connectors key should have a
dictionary as its value. Each key/value pair in this
dictionary will correspond to a Data Connector's
name and configuration, respectively.
The keys in the
data_connectors dictionary will be the
names of the Data Connectors, which you will use to
indicate which Data Connector to use in future
workflows. As with value of your Datasource's
name key, you can use any value you want
for a Data Connector's name. Ideally, you will
use something relevant to the data that each
particular Data Connector will provide; the only
significant difference is that for Data Connectors the
name of the Data Connector is its key in the
data_connectors dictionary.
The values for each of your
data_connectors keys will be the Data
Connector configurations that correspond to each Data
Connector's name. You may define multiple Data
Connectors in the
data_connectors dictionary by including
multiple key/value pairs.
For now, start by adding an empty dictionary as the
value of the data_connectors key. We will
begin populating it with Data Connector configurations
in the next step.
Your current configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {},
}
7. Configure your individual Data Connectors (Splitting, sampling, etc.)
For each Data Connector configuration, you will need
to specify which type of Data Connector you will be
using. For
SQL, the most likely ones will be the
InferredAssetSqlDataConnector, the
ConfiguredAssetSqlDataConnector, and the
RuntimeDataConnector.
If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.
Data Connector example configurations:
- InferredAssetSQLDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
The InferredDataConnector is
ideal for:
- quickly setting up a Datasource and getting access to data
- diving straight in to working with Great Expectations
- initial data discovery and introspection
However, the
InferredDataConnector allows
less control over the definitions of your
Data Assets than the
ConfiguredAssetDataConnector
provides.
If you are at the point of building a
repeatable workflow, we encourage using the
ConfiguredAssetDataConnector
instead.
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_inferred_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {},
},
}
Required Data Connector configuration keys
When defining an
InferredAssetSqlDataConnector you
will need to provide values for
one required key in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_inferred_data_connector"
in the example above).
This key/value pair consists of:
-
class_name: The name of the Class that will be instantiated for thisDataConnector.
For this example, you will be using the
InferredAssetSqlDataConnector as
your class_name. This is a subclass
of the
InferredAssetDataConnector that is
specialized to support SQL Execution Engines,
such as the
SqlAlchemyExecutionEngine. This
key/value entry will therefore look like:
"class_name": "InferredAssetSqlDataConnector",
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name key alongside the
class_name key.
The value for module_name would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name and
module_name for a custom
Datasource or Execution Engine.
With this value added your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetSqlDataConnector",
},
},
}
Optional Data Connector configuration key for defining introspection behaviour
There is an optional key that can be defined for your Data Connector to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
-
introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
You will find a list of the valid keys for
the
introspection_directives
dictionary and their corresponding values in
the
Introspection directives subsection of
the Additional notes
at the end of this guide.
There are also optional keys that you can define to alter how Data Assets are inferred, how they are split into Batches, and how data is sampled from Batches. This guide will go over those in the next section.
A
ConfiguredAssetDataConnector
enables the most fine-tuning, allowing you
to easily work with multiple Batches. It
also requires an explicit listing of each
Data Asset you connect to and how Batches or
defined within that Data Asset, which makes
it very clear what Data Assets are being
provided when you reference it in Profilers,
Batch Requests, or Checkpoints..
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_configured_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {},
},
}
Required Data Connector configuration keys
When defining a
ConfiguredAssetSqlDataConnector
you will need to provide values for
two required keys in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_configured_data_connector"
in the example above).
These key/value pairs consist of:
-
class_name: The name of the Class that will be instantiated for thisDataConnector. -
assets: A dictionary where the keys are the names of specific Data Assets and the values are the configurations for the corresponding Data Asset.
For this example, you will be using the
ConfiguredAssetSqlDataConnector as
your class_name. This is a subclass
of the
ConfiguredAssetDataConnector that
is specialized to support SQL Execution Engines,
such as the
SqlAlchemyExecutionEngine. This
key/value entry will therefore look like:
"class_name": "ConfiguredAssetSqlDataConnector",
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name key alongside the
class_name key.
The value for module_name would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name and
module_name for a custom
Datasource or Execution Engine.
With this value added, along with a blank
dictionary for assets, your full
configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetSqlDataConnector",
"assets": {},
},
},
}
Optional Data Connector configuration keys for splitting Data Assets into Batches
In addition to the above key/value pairs, there are some optional key/value pairs that you can use to define how your Data Assets are split into Batches. If you define these key/value pairs in your Data Connector dictionary, they will be applied to all the Data Connector's Data Assets. However, if you choose not to define these key/value pairs, the Data Assets that are defined for this Data Connector will default to returning a single Batch consisting of a full table when requested from a Batch Request.
The key/value pairs that used for splitting a Data Asset into Batches are:
-
splitter_method: A string that names the method that will be used to split the target table into multiple Batches. -
splitter_kwargs: A dictionary containing keyword arguments to pass tosplitter_methodifsplitter_methodhas been defined.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Connector configuration keys for sampling data from returned Batches
You may wish to only sample the data that is
returned in your Data Asset's Batches. To
do this, you will need to define the optional
keys sampling_method and
sampling_kwargs. As with
splitter_method and
splitter_kwargs, defining these
key/value pairs in your Data Connector's
dictionary will result in those values being
applied to all Data Assets that are made
available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
-
sampling_method: A string that names the method that will be used to sample data from returned Batches. -
sampling_kwargs: A dictionary containing keyword arguments to pass tosampling_methodifsampling_methodhas been defined.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
Optional Data Connector configuration key for defining introspection behaviour
Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
-
introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
You will find a list of the valid keys for
the
introspection_directives
dictionary and their corresponding values in
the
Introspection directives subsection of
the Additional notes
at the end of this guide.
A note on optional Data Connector keys
These key/value pairs can also be defined in the configurations for individual Data Assets, which will be shown later in this guide. If these values are defined both in the Data Connector dictionary and in a Data Asset dictionary, the definition in the Data Asset will take precedence.
A RuntimeDataConnector is used
to connect to an in-memory dataframe or
path. The dataframe or path used for a
RuntimeDataConnector is
therefore passed to the
RuntimeDataConnector as part of
a Batch Request, rather than being a static
part of the
RuntimeDataConnector's
configuration.
A Runtime Data Connector will always only
return one Batch of data: the
current data that was passed in or
specified as part of a Batch Request. This
means that a
RuntimeDataConnector does not
define Data Assets like an
InferredDataConnector or a
ConfiguredDataConnector would.
Instead, a Runtime Data Connector's configuration will provides a way for you to attach identifying values to a returned Batch of data so that the data as it was at the time it was returned can be referred to again in the future.
For more information on configuring a Batch Request for a Pandas Runtime Data Connector, please see our guide on how to create a Batch of data from an in-memory Spark or Pandas dataframe or path.
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_runtime_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {},
},
}
Required Data Connector configuration keys
When defining a
ConfiguredAssetSqlDataConnector
you will need to provide values for
two required keys in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_configured_data_connector"
in the example above).
These key/value pairs consist of:
-
class_name: The name of the Class that will be instantiated for thisDataConnector. -
batch_identifiers: A list of strings that will be used as keys for identifying metadata that the user provides for the returned Batch.
For this example, you will be using the
RuntimeDataConnector as your
class_name. This key/value entry
will therefore look like:
"class_name": "RuntimeDataConnector",
After including an empty list for your
batch_identifiers your full
configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": {},
},
},
}
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name key alongside the
class_name key.
The value for module_name would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name and
module_name for a custom
Datasource or Execution Engine.
8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)
- InferredAssetSQLDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
In an Inferred Asset Data Connector for SQL data, all the behaviour for how Data Assets are inferred, how they are split into Batches, and how those Batches are sampled for data are configured in the Data Connector dictionary. To alter any of these behaviours from the default, you will simply set your desired behaviour by defining one of the following optional key/value pairs.
Optional Data Connector configuration keys for inferring Data Assets
A Data Asset in an Inferred Asset Data Connector for SQL data will consist of a single table in the database you have connected to. You can modify the way that the Data Connector infers which tables to utilize as Data Assets by defining the following key/value pairs in your Data Connector's dictionary in the Datasource configuration:
-
data_asset_name_prefix: A string describing an optional prefix to prepend to the names of inferred Data Assets. -
data_asset_name_suffix: A string describing an optional suffix to append to the names of inferred Data Assets. -
include_schema_name: A boolean value which answers the question : "Should thedata_asset_nameinclude the schema as a prefix?" -
excluded_tables: A list of tables to ignore when inferring Data Assets. -
included_tables: A list that, if defined, will limit the inferred Data Assets to those tables that are included in the list. -
skip_inapplicable_tables: A boolean value. IfTrue, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred Data Assets. IfFalse, the class will throw an error during initialization if any such tables are encountered.
Optional Data Connector configuration keys for splitting Data Assets into Batches
Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:
-
splitter_method: A string that names the method that will be used to split the target table into multiple Batches. -
splitter_kwargs: A dictionary containing keyword arguments to pass tosplitter_methodifsplitter_methodhas been defined.
For example, imagine that you have one or more
tables containing the NYC taxi data from the
getting started tutorial in your database. You
could instruct your Data Connector to infer Data
Assets that return each table as a single Batch
by simply not including a
splitter_method. Such a
configuration would be identical to the data
connector
name_of_my_inferred_data_connector
that was defined in the example at the end of
step 7, so let's rename that
data_connector entry
inferred_data_connector_single_batch_asset
since that is more meaningful. Your
configuration for a single Batch Data Asset
would now look like:
"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},
Alternatively, you could define a Data Asset
that is split into Batches based on the year and
month by defining the
splitter_method to be
split_on_year_and_month and
providing a Datetime column. (In the case of the
NYC taxi data, this would be the
pickup_datetime column.) Creating a
Data Asset like this would result in your
configuration being:
"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
If you included both of these Data Assets, your complete configuration would look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},
"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
},
}
If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Connector configuration keys for sampling data from returned Batches
Finally, you may wish to only sample a portion
of the data that would be returned in your Data
Asset's Batches. To do this, you will need
to define the optional keys
sampling_method and
sampling_kwargs. As with
splitter_method and
splitter_kwargs, defining these
key/value pairs in your Data Connector's
dictionary will result in those values being
applied to all Data Assets that are made
available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
-
sampling_method: A string that names the method that will be used to sample data from returned Batches. -
sampling_kwargs: A dictionary containing keyword arguments to pass tosampling_methodifsampling_methodhas been defined.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
In a Configured Asset Data Connector for SQL
data, each entry in the
assets dictionary will correspond
to an explicitly defined Data Asset. The key
provided will be used as the name of the Data
Asset, while the value will be a dictionary that
specifies how that Data Asset is defined.
These key/value pairs are technically all optional. If none of them are defined, the Data Asset will correspond to the table that matches the key corresponding to the empty Data Asset dictionary.
Optional Data Asset configuration keys for Data Asset names and schemas:
You may provide the following key/value pairs in your Data Asset configuration to alter how your Data Asset behaves regarding its associated table:
-
table_name: A string that, if defined, is used as the name for the Data Asset. If this is not defined, the default name will be that of the key corresponding to the Data Asset's dictionary in your configuration. -
schema_name: An optional string that defines theschemafor the Data Asset. -
include_schema_name: A boolean value that determines whether theschema_nameshould be included as a prefix to the Data Asset's name.
For example, imagine that you have a copy of the
NYC taxi data from the getting started tutorial
in a table called
yellow_tripdata_sample_2020, along
with a public schema. You could access this data
by defining an entry in the
assets dictionary like:
"yellow_tripdata_sample_2020_full": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
},
Optional Data Asset configuration keys for splitting Data Assets into Batches
Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:
-
splitter_method: A string that names the method that will be used to split the target table into multiple Batches. -
splitter_kwargs: A dictionary containing keyword arguments to pass tosplitter_methodifsplitter_methodhas been defined.
The configuration we provided above does not
include a splitter_method, and
therefore will return the entire table as a
single Batch.
Alternatively, you could define a Data Asset
that is split into Batches based on the year and
month by defining the
splitter_method to be
split_on_year_and_month and
providing a Datetime column. (In the case of the
NYC taxi data, this would be the
pickup_datetime column.) Creating a
Data Asset like this would result in the
following key/value pair in your
assets dictionary:
"yellow_tripdata_sample_2020_by_year_and_month": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
If you included both of these Data Assets, your complete configuration would look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetSqlDataConnector",
"assets": {
"yellow_tripdata_sample_2020_full": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
},
"yellow_tripdata_sample_2020_by_year_and_month": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
},
},
},
}
If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.
The splitter_method and
splitter_kwargs key/value pairs can
be defined at either the Data Connector or the
Data Asset level. If they are defined at the
Data Connector level, they will apply to all
Data Assets that do not have an alternative
definition in their configuration. If they are
defined at the Data Asset level, the Data Asset
definition will take precedence over the Data
Connector definition.
You can think of the Data Connector level
definition of splitter_method and
splitter_kwargs as a way to define
default values of these keys for your Data
Assets. The Data Asset level definitions, then,
would be Data Asset specific definitions that
overwrite those defaults.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Asset configuration keys for sampling data from returned Batches
Finally, you may wish to only sample a portion
of the data that would be returned in your Data
Asset's Batches. To do this, you will need
to define the optional keys
sampling_method and
sampling_kwargs. As with
splitter_method and
splitter_kwargs, defining these
key/value pairs in your Data Connector's
dictionary will result in those values being
applied to all Data Assets that are made
available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
-
sampling_method: A string that names the method that will be used to sample data from returned Batches. -
sampling_kwargs: A dictionary containing keyword arguments to pass tosampling_methodifsampling_methodhas been defined.
As with splitter_method and
splitter_kwargs,
sampling_method and
sampling_kwargs can be defined at
either the Data Connector or the Data Asset
level. When defined at the Data Connector level,
the definition acts as a default that applies to
all Data Assets that do not have their own
specific definition for the two keys. If defined
at the Data Asset level, the values will take
precedence over any that are defined in the Data
Connector dictionary.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
Optional Data Asset configuration key for defining introspection behaviour
Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
-
introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
Similar to the splitting and sampling key/value pairs, this key can be defined at either the Data Connector or the Data Asset level. A definition at the Data Connector level will be applied to all Data Assets, while a definition at the Data Asset level will take precidence over the Data Connector's values.
You will find a list of the valid keys for
the
introspection_directives
dictionary and their corresponding values in
the
Introspection directives subsection of
the Additional notes
at the end of this guide.
Runtime Data Connectors put a wrapper around a single Batch of data, and therefore do not support Data Asset configurations that permit the return of more than one Batch of data. In fact, since you will use a Batch Request to pass in or specify the data that a Runtime Data Connector uses, there is no need to specify a Data Asset configuration at all.
Instead, you will provide a
batch_identifiers list which will
be used to attach identifying information to a
returned Batch so that you can reference the
same data again in the future.
For this example, lets assume we have the folloÏwing tables in our database:
-
yellow_tripdata_sample_2020-01 -
yellow_tripdata_sample_2020-02 -
yellow_tripdata_sample_2020-03
With a Runtime Data Connector you won't actually refer to them in your configuration! As mentioned above, you will provide the name of one of those tables to the Data Connector as part of a Batch Request.
Therefore, the table names are inconsequential
to your Runtime Data Connector's
configuration. In fact, the
batch_identifiers that you define
in your Runtime Data Connector's
configuration can be completely arbitrary.
However, it is advised you name them after
something meaningful regarding your data or the
circumstances under which you will be accessing
your data.
For instance, let's assume you are getting a daily update to your data, and so you are running daily validations. You could then choose to identify your Runtime Data Connector's Batches by the timestamp at which they are requested.
To do this, you would simply add a
batch_timestamp entry in your
batch_identifiers list. This would
look like:
"batch_identifiers": ["batch_timestamp"],
Then, when you create your Batch Request you
would populate the
batch_timestamp value in its
batch_identifiers dictionary with
the value of the current date and time. This
will attach the current date and time to the
returned Batch, allowing you to reference the
Batch again in the future even if the current
data (the data that would be provided by the
Runtime Data Connector if you requested a new
Batch) had changed.
The full configuration for your Datasource should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["batch_timestamp"],
},
},
}
We stated above that the names that you use
for your batch_identifiers in a
Runtime Data Connector's configuration
can be completely arbitrary, and will be
used as keys for the
batch_identifiers dictionary in
future Batch Requests.
However, the same holds true for the
values you pass in for each key in
your Batch Request's
batch_identifiers!
Always make sure that your Batch Requests
utilizing Runtime Data Connectors are
providing meaningful identifying
information, consistent with the keys that
are derived from the
batch_identifiers you have
defined in your Runtime Data
Connector's configuration.
9. Test your configuration with
.test_yaml_config(...)
Now that you have a full Datasource configuration, you
can confirm that it is valid by testing it with the
.test_yaml_config(...) method. To do
this, execute the Python code:
data_context.test_yaml_config(yaml.dump(datasource_config))
When executed, test_yaml_config will
instantiate the component described by the yaml
configuration that is passed in and then run a self
check procedure to verify that the component works as
expected.
For a Datasource, this includes:
- confirming that the connection works
- gathering a list of available Data Assets
- verifying that at least one Batch can be fetched from the Datasource
For more information on the
.test_yaml_config(...) method, please see
our guide on
how to configure
DataContext components using
test_yaml_config.
10. (Optional) Add more Data Connectors to your configuration
The data_connectors dictionary in your
datasource_config can contain multiple
entries. If you want to add additional Data
Connectors, just go through the process starting at
step 7 again.
11. Add your new Datasource to your Data Context
Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:
data_context.add_datasource(**datasource_config)
If the value of
datasource_config["name"]
corresponds to a Datasource that is already
defined in your Data Context, then using the above
command will overwrite the existing Datasource.
If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:
# add_datasource only if it doesn't already exist in your Data Context
try:
data_context.get_datasource(datasource_config["name"])
except ValueError:
data_context.add_datasource(**datasource_config)
else:
print(
f"The datasource {datasource_config['name']} already exists in your Data Context!"
)
Next steps
Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.
For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.
You can now move forward and create Expectations for your Datasource.
Additional notes
Splitting methods
-
The names of
splitter_methodvalues can be specified with or without a preceding underscore.
Available methods for
splitter_method values and their
configuration parameters:
splitter_method |
splitter_kwargs |
Returned Batch Data |
|---|---|---|
split_on_whole_table |
N/A | Identical to original |
split_on_column_value |
column_name='col' |
Rows where value of column_name are same |
split_on_year |
column_name='col' |
Rows where the year of a datetime column are the same |
split_on_year_and_month |
column_name='col' |
Rows where the year and month of a datetime column are the same |
split_on_year_and_month_and_day
|
column_name='col' |
Rows where the year, month and day of a datetime column are the same |
split_on_date_parts |
column_name='col',
date_parts='<list[DatePart]>'
|
Rows where the date parts of a datetime column are the same. Date parts can be specified as DatePart objects or as their string equivalent e.g. "year", "month", "week", "day", "hour", "minute", or "second" |
split_on_divided_integer |
column_name='col',
divisor=<int>
|
Rows where value of column_name divided (using integral division) by the given divisor are same |
split_on_mod_integer |
column_name='col',
mod=<int>
|
Rows where value of column_name divided (using modular division) by the given mod are same |
split_on_multi_column_values |
column_names='<list[col]>'
|
Rows where values of column_names are same |
split_on_converted_datetime |
column_name='col',
date_format_string=<'%Y-%m-%d'>
|
Rows where value of column_name converted to datetime using the given date_format_string are same |
split_on_hashed_column |
column_name='col',
hash_digits=<int>
|
Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental) |
Sampling methods
-
The names of
sampling_methodvalues can be specified with or without a preceding underscore.
Available methods for
sampling_method values and their
configuration parameters:
sampling_method |
sampling_kwargs |
Returned Batch Data |
|---|---|---|
sample_using_limit |
n=num_rows |
First up to to n (specific limit parameter) rows of batch |
sample_using_random |
p=fraction |
Rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
sample_using_mod |
column_name='col',
mod=<int>
|
Take the mod of named column, and only keep rows that match the given value |
sample_using_a_list |
column_name='col',
value_list=<list[val]>
|
Match the values in the named column against value_list, and only keep the matches |
sample_using_hash |
column_name='col',
hash_digits=<int>,
hash_value=<str>
|
Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value |
Introspection directives
Valid keys for the
introspection directives dictionary
include:
-
schema_name: A string describing schema to introspect (default isNone). If this is provided, the introspection will be limited to the specified schema. -
ignore_information_schemas_and_system_tables: A boolean value (default=True) which determines whether to ignore information schemas and system tables when introspecting the database. -
system_tables: An optional list of strings that definesystem_tablesfor your database. -
information_schemas: An optional list of strings that defineinformation_schemasfor your database. -
include_views: A boolean value (default=True) which determines whether to includeviewswhen introspecting the database.