Skip to main content
Version: 0.15.50

Connect to data icon How to configure a SQL Datasource

Setup Arrow Connect to Data Arrow Create Expectations Arrow Validate Data

This guide will walk you through the process of configuring a SQL Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a SQL Datasource which you can use in future workflows for creating Expectations and Validating data.

Steps

1. Import necessary modules and initialize your Data Context

from ruamel import yaml

import great_expectations as gx

data_context: gx.DataContext = gx.get_context()

The great_expectations module will give you access to your Data Context, which is the entry point for working with a Great Expectations project.

The yaml module from ruamel will be used in validating your Datasource's configuration. Great Expectations will use a Python dictionary representation of your Datasource configuration when you add your Datasource to your Data Context. However, Great Expectations saves configurations as yaml files, so when you validate your configuration you will need to convert it from a Python dictionary to a yaml string, first.

Your Data Context that is initialized by get_data_context() will be the Data Context defined in your current working directory. It will provide you with convenience methods that we will use to validate your Datasource configuration and add your Datasource to your Great Expectations project once you have configured it.

2. Create a new Datasource configuration.

A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.

To start, create an empty dictionary. You will be populating it with keys as you go forward.

At this point, the configuration for your Datasource is merely:

datasource_config: dict = {}

However, from this humble beginning you will be able to build a full Datasource configuration.

The keys needed for your Datasource configuration

At the top level, your Datasource's configuration will need the following keys:

  • name: The name of the Datasource, which will be used to reference the datasource in Batch Requests.
  • class_name: The name of the Python class instantiated by the Datasource. Typically, this will be the Datasource class.
  • module_name: the name of the module that contains the Class definition indicated by class_name.
  • execution_engine: a dictionary containing the class_name and module_name of the Execution Engine instantiated by the Datasource.
  • data_connectors: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.

In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.

3. Name your Datasource

The first key that you will need to define for your new Datasource is its name. You will use this to reference the Datasource in future workflows. It can be anything you want it to be, but ideally you will name it something relevant to the data that it interacts with.

For the purposes of this example, we will name this Datasource:

"name": "my_datasource_name",  # Preferably name it something relevant

You should, however, name your Datsource something more relevant to your data.

At this point, your configuration should now look like:

datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
}

4. Specify the Datasource class and module

The class_name and module_name for your Datasource will almost always indicate the Datasource class found at great_expectations.datasource. You may replace this with a specialized subclass, or a custom class, but for almost all regular purposes these two default values will suffice. For the purposes of this guide, add those two values to their corresponding keys.

"class_name": "Datasource",
"module_name": "great_expectations.datasource"

Your full configuration should now look like:

datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
}

5. Add the SqlAlchemy Execution Engine to your Datasource configuration

Your Execution Engine is where you will specify that you want this Datasource to use SQL in the backend. As with the Datasource top level configuration, you will need to provide the class_name and module_name that indicate the class definition and containing module for the Execution Engine that you will use.

For the purposes of this guide, these will consist of the SqlAlchemyExecutionEngine found at great_expectations.execution_engine. The execution_engine key and its corresponding value will therefore look like this:

"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
},

Additionally, your execution_engine dictionary will require a values for either connection_string or credentials. You will only need to use one of these keys as they each serve the same purpose: to provide the parameters necessary for the SqlAlchemyExecutionEngine to connect to your desired database. For the purposes of this guide we will use the connection_string key, the value of which will be the string representation of the information necessary to connect to your SQL database. At this point your configuration should look like:

"connection_string": CONNECTION_STRING,
tip

Your connection string will vary depending on the type of SQL database you are connecting to. For more information on how to configure your connection string, please see the appropriate guide for connecting to a specific Database.

After adding the above snippets to your Datasource configuration, your full configuration dictionary should now look like:

datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
}
Tip: Using credentials instead of connection_string

The credentials key uses a dictionary to provide the elements of your connection string as separate, individual values. For information on how to populate the credentials dictionary and how to configure your great_expectations.yml project config file to populate credentials from either a YAML file or a secret manager, please see our guide on How to configure credentials.

Info: Additional (optional) Execution Engine keys

In addition to substituting credentials for connection_string, there are a few other optional keys that you can add to your execution_engine dictionary.

The first of these is the optional key create_temp_table. Creating a temp table to improve query efficiency is the default behaviour of the SqlAlchemyExecutionEngine. However, you can change this to False if you don't have the permissions to create a temp table on the database you are working with.

The other valid keys are not actually defined by the Great Expectations code. Instead, any other key/value pairs that you add to the Execution Engine configuration will be passed directly to SqlAlchemy's create_engine(...) method. This allows you to fully leverage all the capabilities supported by SqlAlchemy through that method.

For more information on what keys create_engine(...) will accept as parameters and how you can utilize this feature, please refer to SqlAlchemy's documentation.

6. Add a dictionary as the value of the data_connectors key

The data_connectors key should have a dictionary as its value. Each key/value pair in this dictionary will correspond to a Data Connector's name and configuration, respectively.

The keys in the data_connectors dictionary will be the names of the Data Connectors, which you will use to indicate which Data Connector to use in future workflows. As with value of your Datasource's name key, you can use any value you want for a Data Connector's name. Ideally, you will use something relevant to the data that each particular Data Connector will provide; the only significant difference is that for Data Connectors the name of the Data Connector is its key in the data_connectors dictionary.

The values for each of your data_connectors keys will be the Data Connector configurations that correspond to each Data Connector's name. You may define multiple Data Connectors in the data_connectors dictionary by including multiple key/value pairs.

For now, start by adding an empty dictionary as the value of the data_connectors key. We will begin populating it with Data Connector configurations in the next step.

Your current configuration should look like:

datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {},
}

7. Configure your individual Data Connectors (Splitting, sampling, etc.)

For each Data Connector configuration, you will need to specify which type of Data Connector you will be using. For SQL, the most likely ones will be the InferredAssetSqlDataConnector, the ConfiguredAssetSqlDataConnector, and the RuntimeDataConnector.

Reminder

If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.

Data Connector example configurations:

tip

The InferredDataConnector is ideal for:

  • quickly setting up a Datasource and getting access to data
  • diving straight in to working with Great Expectations
  • initial data discovery and introspection

However, the InferredDataConnector allows less control over the definitions of your Data Assets than the ConfiguredAssetDataConnector provides.

If you are at the point of building a repeatable workflow, we encourage using the ConfiguredAssetDataConnector instead.

Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name version-0.15.50 name_of_my_inferred_data_connector but you may have it be anything you like.

At this point, your configuration should look like:

datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {},
},
}

Required Data Connector configuration keys

When defining an InferredAssetSqlDataConnector you will need to provide values for one required key in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "version-0.15.50 name_of_my_inferred_data_connector" in the example above). This key/value pair consists of:

  • class_name: The name of the Class that will be instantiated for this DataConnector.

For this example, you will be using the InferredAssetSqlDataConnector as your class_name. This is a subclass of the InferredAssetDataConnector that is specialized to support SQL Execution Engines, such as the SqlAlchemyExecutionEngine. This key/value entry will therefore look like:

"class_name": "InferredAssetSqlDataConnector",
tip

Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name along with a default value will be provided when this Data Connector is initialized.

However, if you want to use a custom Data Connector, you will need to explicitly add a module_name key alongside the class_name key.

The value for module_name would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name and module_name for a custom Datasource or Execution Engine.

With this value added your full configuration should now look like:

datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetSqlDataConnector",
},
},
}

Optional Data Connector configuration key for defining introspection behaviour

There is an optional key that can be defined for your Data Connector to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:

  • introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
tip

You will find a list of the valid keys for the introspection_directives dictionary and their corresponding values in the Introspection directives subsection of the Additional notes at the end of this guide.

There are also optional keys that you can define to alter how Data Assets are inferred, how they are split into Batches, and how data is sampled from Batches. This guide will go over those in the next section.

8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)

In an Inferred Asset Data Connector for SQL data, all the behaviour for how Data Assets are inferred, how they are split into Batches, and how those Batches are sampled for data are configured in the Data Connector dictionary. To alter any of these behaviours from the default, you will simply set your desired behaviour by defining one of the following optional key/value pairs.

Optional Data Connector configuration keys for inferring Data Assets

A Data Asset in an Inferred Asset Data Connector for SQL data will consist of a single table in the database you have connected to. You can modify the way that the Data Connector infers which tables to utilize as Data Assets by defining the following key/value pairs in your Data Connector's dictionary in the Datasource configuration:

  • data_asset_name_prefix: A string describing an optional prefix to prepend to the names of inferred Data Assets.
  • data_asset_name_suffix: A string describing an optional suffix to append to the names of inferred Data Assets.
  • include_schema_name: A boolean value which answers the question : "Should the data_asset_name include the schema as a prefix?"
  • excluded_tables: A list of tables to ignore when inferring Data Assets.
  • included_tables: A list that, if defined, will limit the inferred Data Assets to those tables that are included in the list.
  • skip_inapplicable_tables: A boolean value. If True, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred Data Assets. If False, the class will throw an error during initialization if any such tables are encountered.

Optional Data Connector configuration keys for splitting Data Assets into Batches

Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:

  • splitter_method: A string that names the method that will be used to split the target table into multiple Batches.
  • splitter_kwargs: A dictionary containing keyword arguments to pass to splitter_method if splitter_method has been defined.

For example, imagine that you have one or more tables containing the NYC taxi data from the getting started tutorial in your database. You could instruct your Data Connector to infer Data Assets that return each table as a single Batch by simply not including a splitter_method. Such a configuration would be identical to the data connector name_of_my_inferred_data_connector that was defined in the example at the end of step 7, so let's rename that data_connector entry inferred_data_connector_single_batch_asset since that is more meaningful. Your configuration for a single Batch Data Asset would now look like:

"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},

Alternatively, you could define a Data Asset that is split into Batches based on the year and month by defining the splitter_method to be split_on_year_and_month and providing a Datetime column. (In the case of the NYC taxi data, this would be the pickup_datetime column.) Creating a Data Asset like this would result in your configuration being:

"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},

If you included both of these Data Assets, your complete configuration would look like:

datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},
"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
},
}
Reminder

If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.

tip

For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.

Optional Data Connector configuration keys for sampling data from returned Batches

Finally, you may wish to only sample a portion of the data that would be returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method and sampling_kwargs. As with splitter_method and splitter_kwargs, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.

The key/value pairs that are used for sampling data from a Data Asset are:

  • sampling_method: A string that names the method that will be used to sample data from returned Batches.
  • sampling_kwargs: A dictionary containing keyword arguments to pass to sampling_method if sampling_method has been defined.
tip

Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.

9. Test your configuration with .test_yaml_config(...)

Now that you have a full Datasource configuration, you can confirm that it is valid by testing it with the .test_yaml_config(...) method. To do this, execute the Python code:

data_context.test_yaml_config(yaml.dump(datasource_config))

When executed, test_yaml_config will instantiate the component described by the yaml configuration that is passed in and then run a self check procedure to verify that the component works as expected.

For a Datasource, this includes:

  • confirming that the connection works
  • gathering a list of available Data Assets
  • verifying that at least one Batch can be fetched from the Datasource

For more information on the .test_yaml_config(...) method, please see our guide on how to configure DataContext components using test_yaml_config.

10. (Optional) Add more Data Connectors to your configuration

The data_connectors dictionary in your datasource_config can contain multiple entries. If you want to add additional Data Connectors, just go through the process starting at step 7 again.

11. Add your new Datasource to your Data Context

Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:

data_context.add_datasource(**datasource_config)
caution

If the value of datasource_config["name"] corresponds to a Datasource that is already defined in your Data Context, then using the above command will overwrite the existing Datasource.

tip

If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:

# add_datasource only if it doesn't already exist in your Data Context
try:
data_context.get_datasource(datasource_config["name"])
except ValueError:
data_context.add_datasource(**datasource_config)
else:
print(
f"The datasource {datasource_config['name']} already exists in your Data Context!"
)

Next steps

Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.

tip

For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.

You can now move forward and create Expectations for your Datasource.

Additional notes

Splitting methods

NOTES ON SPLITTER METHODS
  • The names of splitter_method values can be specified with or without a preceding underscore.

Available methods for splitter_method values and their configuration parameters:

splitter_method splitter_kwargs Returned Batch Data
split_on_whole_table N/A Identical to original
split_on_column_value column_name='col' Rows where value of column_name are same
split_on_year column_name='col' Rows where the year of a datetime column are the same
split_on_year_and_month column_name='col' Rows where the year and month of a datetime column are the same
split_on_year_and_month_and_day column_name='col' Rows where the year, month and day of a datetime column are the same
split_on_date_parts column_name='col', date_parts='<list[DatePart]>' Rows where the date parts of a datetime column are the same. Date parts can be specified as DatePart objects or as their string equivalent e.g. "year", "month", "week", "day", "hour", "minute", or "second"
split_on_divided_integer column_name='col', divisor=<int> Rows where value of column_name divided (using integral division) by the given divisor are same
split_on_mod_integer column_name='col', mod=<int> Rows where value of column_name divided (using modular division) by the given mod are same
split_on_multi_column_values column_names='<list[col]>' Rows where values of column_names are same
split_on_converted_datetime column_name='col', date_format_string=<'%Y-%m-%d'> Rows where value of column_name converted to datetime using the given date_format_string are same
split_on_hashed_column column_name='col', hash_digits=<int> Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental)

Sampling methods

NOTES ON SAMPLING METHODS
  • The names of sampling_method values can be specified with or without a preceding underscore.

Available methods for sampling_method values and their configuration parameters:

sampling_method sampling_kwargs Returned Batch Data
sample_using_limit n=num_rows First up to to n (specific limit parameter) rows of batch
sample_using_random p=fraction Rows selected at random, whose number amounts to selected fraction of total number of rows in batch
sample_using_mod column_name='col', mod=<int> Take the mod of named column, and only keep rows that match the given value
sample_using_a_list column_name='col', value_list=<list[val]> Match the values in the named column against value_list, and only keep the matches
sample_using_hash column_name='col', hash_digits=<int>, hash_value=<str> Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value

Introspection directives

Valid keys for the introspection directives dictionary include:

  • schema_name: A string describing schema to introspect (default is None). If this is provided, the introspection will be limited to the specified schema.
  • ignore_information_schemas_and_system_tables: A boolean value (default=True) which determines whether to ignore information schemas and system tables when introspecting the database.
  • system_tables: An optional list of strings that define system_tables for your database.
  • information_schemas: An optional list of strings that define information_schemas for your database.
  • include_views: A boolean value (default=True) which determines whether to include views when introspecting the database.