How to configure a Spark Datasource
|
|
|
|
|
|
|
This guide will walk you through the process of configuring a Spark Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a Spark Datasource which you can use in future workflows for creating Expectations and Validating data.
Steps
1. Import necessary modules and initialize your Data Context
from ruamel import yaml
import great_expectations as gx
data_context: gx.DataContext = gx.get_context()
The great_expectations
module will give
you access to your Data Context, which is the entry
point for working with a Great Expectations project.
The yaml
module from
ruamel
will be used in validating your
Datasource's configuration. Great Expectations
will use a Python dictionary representation of your
Datasource configuration when you add your Datasource
to your Data Context. However, Great Expectations
saves configurations as yaml
files, so
when you validate your configuration you will need to
convert it from a Python dictionary to a
yaml
string, first.
Your Data Context that is initialized by
get_data_context()
will be the Data
Context defined in your current working directory. It
will provide you with convenience methods that we will
use to validate your Datasource configuration and add
your Datasource to your Great Expectations project
once you have configured it.
2. Create a new Datasource configuration.
A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.
To start, create an empty dictionary. You will be populating it with keys as you go forward.
At this point, the configuration for your Datasource is merely:
datasource_config: dict = {}
However, from this humble beginning you will be able to build a full Datasource configuration.
The keys needed for your Datasource configuration
At the top level, your Datasource's configuration will need the following keys:
-
name
: The name of the Datasource, which will be used to reference the datasource in Batch Requests. -
class_name
: The name of the Python class instantiated by the Datasource. Typically, this will be theDatasource
class. -
module_name
: the name of the module that contains the Class definition indicated byclass_name
. -
execution_engine
: a dictionary containing theclass_name
andmodule_name
of the Execution Engine instantiated by the Datasource. -
data_connectors
: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.
In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.
3. Name your Datasource
The first key that you will need to define for your
new Datasource is its name
. You will use
this to reference the Datasource in future workflows.
It can be anything you want it to be, but ideally you
will name it something relevant to the data that it
interacts with.
For the purposes of this example, we will name this Datasource:
"name": "my_datasource_name", # Preferably name it something relevant
You should, however, name your Datsource something more relevant to your data.
At this point, your configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
}
4. Specify the Datasource class and module
The class_name
and
module_name
for your Datasource will
almost always indicate the
Datasource
class found at
great_expectations.datasource
. You may
replace this with a specialized subclass, or a custom
class, but for almost all regular purposes these two
default values will suffice. For the purposes of this
guide, add those two values to their corresponding
keys.
"class_name": "Datasource",
"module_name": "great_expectations.datasource"
Your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
}
5. Add the Spark Execution Engine to your Datasource configuration
Your Execution Engine is where you will specify that
you want this Datasource to use
Spark
in the backend. As with the Datasource top level
configuration, you will need to provide the
class_name
and
module_name
that indicate the class
definition and containing module for the Execution
Engine that you will use.
For the purposes of this guide, these will consist of
the SparkDFExecutionEngine
found at
great_expectations.execution_engine
. The
execution_engine
key and its
corresponding value will therefore look like this:
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
}
After adding the above snippet to your Datasource configuration, your full configuration dictionary should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
}
6. Add a dictionary as the value of the
data_connectors
key
The data_connectors
key should have a
dictionary as its value. Each key/value pair in this
dictionary will correspond to a Data Connector's
name and configuration, respectively.
The keys in the
data_connectors
dictionary will be the
names of the Data Connectors, which you will use to
indicate which Data Connector to use in future
workflows. As with value of your Datasource's
name
key, you can use any value you want
for a Data Connector's name. Ideally, you will
use something relevant to the data that each
particular Data Connector will provide; the only
significant difference is that for Data Connectors the
name of the Data Connector is its key in the
data_connectors
dictionary.
The values for each of your
data_connectors
keys will be the Data
Connector configurations that correspond to each Data
Connector's name. You may define multiple Data
Connectors in the
data_connectors
dictionary by including
multiple key/value pairs.
For now, start by adding an empty dictionary as the
value of the data_connectors
key. We will
begin populating it with Data Connector configurations
in the next step.
Your current configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {},
}
7. Configure your individual Data Connectors
For each Data Connector configuration, you will need
to specify which type of Data Connector you will be
using. When using
Spark
to work with data in a file system, the most likely
ones will be the
InferredAssetFilesystemDataConnector
, the
ConfiguredAssetFilesystemDataConnector
,
and the RuntimeDataConnector
.
If you are working with Spark but not working with a file system, please see our cloud specific guides for more information.
If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.
Data Connector example configurations:
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
The InferredDataConnector
is
ideal for:
- quickly setting up a Datasource and getting access to data
- diving straight in to working with Great Expectations
- initial data discovery and introspection
However, the
InferredDataConnector
allows
less control over the definitions of your
Data Assets than the
ConfiguredAssetDataConnector
provides.
If you are at the point of building a
repeatable workflow, we encourage using the
ConfiguredAssetDataConnector
instead.
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_inferred_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_inferred_data_connector": {}},
}
When defining an
InferredAssetFilesystemDataConnector
you will need to provide values for
four keys in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_inferred_data_connector"
in the example above). These key/value pairs
consist of:
-
class_name
: The name of the Class that will be instantiated for thisDataConnector
. -
base_directory
: The string representation of the directory that contains your filesystem data. -
default_regex
: A dictionary that describes how the data should be grouped into Batches. -
batch_spec_passthrough
: A dictionary of values that are passed to the Execution Engine's backend.
Additionally, you may optionally choose to define:
-
glob_directive
: A regular expression that can be used to access source data files contained in subfolders of yourbase_directory
. If this is not defined, the default value of*
will cause you Data Connector to only look at files in thebase_directory
itself.
For this example, you will be using the
InferredAssetFilesystemDataConnector
as your class_name
. This is a
subclass of the
InferredAssetDataConnector
that is
specialized to support filesystem Execution
Engines, such as the
SparkDFExecutionEngine
. This
key/value entry will therefore look like:
"class_name": "InferredAssetFilesystemDataConnector",
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name
along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name
key alongside the
class_name
key.
The value for module_name
would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name
and
module_name
for a custom
Datasource or Execution Engine.
For the base directory, you will want to put the relative path of your data from the folder that contains your Data Context. In this example we will use the same path that was used in the Getting Started Tutorial, Step 2: Connect to Data. Since we are manually entering this value rather than letting the CLI generate it, the key/value pair will look like:
"base_directory": "../data",
With these values added, along with blank
dictionary for default_regex
(we
will define it in the next step) and
batch_spec_passthrough
, your full
configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {},
"batch_spec_passthrough": {},
}
},
}
glob_directive
The glob_directive
parameter is
provided to give the
DataConnector
information about
the directory structure to expect when
identifying source data files to check
against each Data Asset's
default_regex
. If you do not
specify a value for
glob_directive
a default value
of "*"
will be used.
This will cause your Data Asset to check all
files in the folder specified by
base_directory
to determine
which should be returned as Batches for the
Data Asset, but will ignore any files in
subdirectories.
Overriding the
glob_directive
by providing
your own value will allow your Data
Connector to traverse subdirectories or
otherwise alter which source data files are
compared against your Data Connector's
default_regex
.
For example, assume your source data is in
files contained by subdirectories of your
base_folder
, like so:
- 2019/yellow_taxidata_2019_01.csv
- 2020/yellow_taxidata_2020_01.csv
- 2021/yellow_taxidata_2021_01.csv
- 2022/yellow_taxidata_2022_01.csv
To include all of these files, you would
need to tell the Data connector to look for
files that are nested one level deeper than
the base_directory
itself.
You would do this by setting the
glob_directive
key in your Data
Connector config to a value of
"*/*"
. This value
will cause the Data Connector to look for
regex matches against the file names for all
files found in any subfolder of your
base_directory
. Such an entry
would look like:
"glob_directive": "*.*"
The glob_directive
parameter
works off of regex. You can also use it to
limit the files that will be compared
against the Data Connector's
default_regex
for a match. For
example, to only permit
.csv
files to be checked for a
match, you could specify the
glob_directive
as
"*.csv"
. To only
check for matches against the
.csv
files in subdirectories,
you would use the value
*/*.csv
, and so forth.
In this guide's examples, all of our
data is assumed to be in the
base_directory
folder.
Therefore, you will not need to add an entry
for glob_directive
to your
configuration. However, if you were to
include the example
glob_directive
from above, your
full configuration would currently look
like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"glob_directive": "*/*",
"default_regex": {},
"batch_spec_passthrough": {},
}
},
}
A
ConfiguredAssetDataConnector
enables the most fine-tuning, allowing you
to easily work with multiple Batches. It
also requires an explicit listing of each
Data Asset you connect to and how Batches or
defined within that Data Asset, which makes
it very clear what Data Assets are being
provided when you reference it in Profilers,
Batch Requests, or Checkpoints..
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_configured_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_configured_data_connector": {}},
}
When defining an
ConfiguredAssetFilesystemDataConnector
you will need to provide values for
four keys in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_configured_data_connector"
in the example above). These key/value pairs
consist of:
-
class_name
: The name of the Class that will be instantiated for thisDataConnector
. -
base_directory
: The string representation of the directory that contains your filesystem data. -
default_regex
: A dictionary that describes how the data should be grouped into Batches. -
batch_spec_passthrough
: A dictionary of values that are passed to the Execution Engine's backend.
For this example, you will be using the
ConfiguredAssetFilesystemDataConnector
as your class_name
. This is a
subclass of the
ConfiguredAssetDataConnector
that
is specialized to support filesystem Execution
Engines, such as the
SparkDFExecutionEngine
. This
key/value entry will therefore look like:
"class_name": "ConfiguredAssetFilesystemDataConnector",
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name
along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name
key alongside the
class_name
key.
The value for module_name
would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name
and
module_name
for a custom
Datasource or Execution Engine.
For the base directory, you will want to put the relative path of your data from the folder that contains your Data Context. In this example we will use the same path that was used in the Getting Started Tutorial, Step 2: Connect to Data. Since we are manually entering this value rather than letting the CLI generate it, the key/value pair will look like:
"base_directory": "../data",
With these values added, along with blank
dictionaries for assets
and
batch_spec_passthrough
, your full
configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {},
"batch_spec_passthrough": {},
}
},
}
A RuntimeDataConnector
is used
to connect to an in-memory dataframe or
path. The dataframe or path used for a
RuntimeDataConnector
is
therefore passed to the
RuntimeDataConnector
as part of
a Batch Request, rather than being a static
part of the
RuntimeDataConnector
's
configuration.
A Runtime Data Connector will always only
return one Batch of data: the
current data that was passed in or
specified as part of a Batch Request. This
means that a
RuntimeDataConnector
does not
define Data Assets like an
InferredDataConnector
or a
ConfiguredDataConnector
would.
Instead, a Runtime Data Connector's configuration will provides a way for you to attach identifying values to a returned Batch of data so that the data as it was at the time it was returned can be referred to again in the future.
For more information on configuring a Batch Request for a Pandas Runtime Data Connector, please see our guide on how to create a Batch of data from an in-memory Spark or Pandas dataframe or path.
Remember, the key that you provide for each Data
Connector configuration dictionary will be used
as the name of the Data Connector. For this
example, we will use the name
version-0.15.50
name_of_my_runtime_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_runtime_data_connector": {}},
}
When defining an
RuntimeDataConnector
you will
need to provide values for
two keys in the Data
Connector's configuration dictionary (the
currently empty dictionary that corresponds to
"version-0.15.50
name_of_my_runtime_data_connector"
in the example above). These key/value pairs
consist of:
-
class_name
: The name of the Class that will be instantiated for thisDataConnector
. -
batch_identifiers
: A list of strings that will be used as keys for identifying metadata that the user provides for the returned Batch.
For this example, you will be using the
RuntimeDataConnector
as your
class_name
. This key/value entry
will therefore look like:
"class_name": "RuntimeDataConnector",
After including an empty list for your
batch_identifiers
and an empty
dictionary for
batch_spec_passthrough
your full
configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {},
"batch_identifiers": [],
}
},
}
Because we are using one of Great
Expectation's builtin Data Connectors,
an entry for module_name
along
with a default value will be provided when
this Data Connector is initialized.
However, if you want to use a custom Data
Connector, you will need to explicitly add a
module_name
key alongside the
class_name
key.
The value for module_name
would
then be set as the import path for the
module containing your custom Data
Connector, in the same fashion as you would
provide class_name
and
module_name
for a custom
Datasource or Execution Engine.
8. Configure the values for
batch_spec_passthrough
The parameter batch_spec_passthrough
is
used to access some native capabilities of your
Execution Engine. If you do not specify it, your
Execution Engine will attempt to determine the values
based off of file extensions and defaults. If you do
define it, it will contain two keys:
reader_method
and
reader_options
. These will correspond to
a string and a dictionary, respectively.
"batch_spec_passthrough": {
"reader_method": "",
"reader_options": {},
Configuring your reader_method
:
The reader_method
is used to specify
which of Spark's
spark.read.*
methods will be used to read
your data. For our example, we are using
.csv
files as our source data, so we will
specify the csv
method of
spark.reader
as our
reader_method
, like so:
"reader_method": "csv",
Configuring your reader_options
:
Start by adding a blank dictionary as the value of the
reader_options
parameter. This dictionary
will hold two key/value pairs: header
and
inferSchema
.
"reader_options": {
"header": "",
"inferSchema": "",
},
The first key is header
, and the value
should be either True
or
False
. This will indicate to the Data
Connector whether or not the first row of each source
data file is a header row. For our example, we will
set this to True
.
"header": True,
The second key to include is inferSchema
.
Again, the value should be either True
or
False
. This will indicate to the Data
Connector whether or not the Execution Engine should
attempt to infer the data type contained by each
column in the source data files. Again, we will set
this to True
for the purpose of this
guide's example.
"inferSchema": True,
-
inferSchema
will read datetime columns in as text columns.
At this point, your
batch_spec_passthrough
configuration
should look like:
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
And your full configuration will look like:
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
"batch_identifiers": [],
}
},
}
9. Configure your Data Connector's Data Assets
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
In an Inferred Asset Data Connector for
filesystem data, a regular expression is used to
group the files into Batches for a Data Asset.
This is done with the value we will define for
the Data Connector's
default_regex
key. The value for
this key will consist of a dictionary that
contains two values:
-
pattern
: This is the regex pattern that will define your Data Asset's potential Batch or Batches. -
group_names
: This is a list of names that correspond to the groups you defined inpattern
's regular expression.
The pattern
in
default_regex
will be matched
against the files in your
base_directory
, and everything that
matches against the first group in your regex
will become a Batch in a Data Asset that
possesses the name of the matching text. Any
files that have a matching string for the first
group will become Batches in the same Data
Asset.
This means that when configuring your Data Connector's regular expression, you have the option to implement it so that the Data Connector is only capable of returning a single Batch per Data Asset, or so that it is capable of returning multiple Batches grouped into individual Data Assets. Each type of configuration is useful in certain cases, so we will provide examples of both.
If you are uncertain as to which type of configuration is best for your use case, please refer to our guide on how to choose between working with a single or multiple Batches of data.
- Single Batch Configuration
- Multi-Batch Configuration
Because of the simple regex matching that groups files into Batches for a given Data Asset, it is actually quite straight forward to create a Data Connector which has Data Assets that are only capable of providing a single Batch. All you need to do is define a regular expression that consists of a single group which corresponds to a unique portion of your data files' names that is unique for each file.
The simplest way to do this is to define a group that consists of the entire file name.
For this example, lets assume we have the
following files in our
data
directory:
-
yellow_tripdata_sample_2020-01.csv
-
yellow_tripdata_sample_2020-02.csv
-
yellow_tripdata_sample_2020-03.csv
In this case you could define the
pattern
key as follows:
"pattern": "(.*)\\.csv",
This regex will match the full name of any
file that has the
.csv
extension, and will put
everything prior to
.csv
extension into a group.
Since each .csv
file will
necessarily have a unique name preceeding
its extension, the content that matches
this pattern will be unique for each file.
This will ensure that only one file is
included as a Batch for each Data Asset.
To correspond to the single group that was
defined in your regex, you will define a
single entry in the list for the
group_names
key. Since the
first group in an Inferred Asset Data
Connector is used to generate names for
the inferred Data Assets, you should name
that group as follows:
"group_names": ["data_asset_name"],
Looking back at our sample files, this
regex will result in the
InferredAssetFilesystemDataConnector
providing three Data Assets, which can be
accessed by the portion of the file that
matches the first group in our regex. In
future workflows you will be able to refer
to one of these Data Assets in a Batch
Request py providing one of the following
data_asset_name
s:
-
yellow_tripdata_sample_2020-01
-
yellow_tripdata_sample_2020-02
-
yellow_tripdata_sample_2020-03
Since we did not include
.csv
in the first group
of the regex we defined, the
.csv
portion of the
filename will be dropped from the
value that is recognized as a valid
data_asset_name
.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(.*)\\.csv",
"group_names": ["data_asset_name"],
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(.*)\\.csv",
"group_names": ["data_asset_name"],
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Configuring an
InferredAssetFilesystemDataConnector
so that its Data Assets are capable of
returning more than one Batch is just a
matter of defining an appropriate regular
expression. For this kind of
configuration, the regular expression you
define should have two or more groups.
The first group will be treated as the Data Asset's name. It should be a portion of your file names that occurs in more than one file. The files that match this portion of the regular expression will be grouped together as a single Data Asset.
Any additional groups that you include in your regular expression will be used to identify specific Batches among those that are grouped together in each Data Asset.
For this example, lets assume you have the
following files in our
data
directory:
-
yellow_tripdata_sample_2020-01.csv
-
yellow_tripdata_sample_2020-02.csv
-
yellow_tripdata_sample_2020-03.csv
You can configure a Data Asset that groups
these files together and differentiates
each batch by month by defining a
pattern
in the dictionary for
the default_regex
key:
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
This regex will group together all files
that match the content of the first group
as a single Data Asset. Since the first
group does not include any special regex
characters, this means that all of the
.csv
files that start with
"yellow_tripdata_sample_2020"
will be combined into one Data Asset, and
that all other files will be ignored.
The second defined group consists of the
numeric characters after the last dash in
a file name and prior to the
.csv
extension. Specifying a
value for that group in your future Batch
Requests will allow you to request a
specific Batch from the Data Asset.
Since you have defined two groups in your
regex, you will need to provide two
corresponding group names in your
group_names
key. Since the
first group in an Inferred Asset Data
Connector is used to generate the names
for the inferred Data Assets provided by
the Data Connector and the second group
you defined corresponds to the month of
data that each file contains, you should
name those groups as follows:
"group_names": ["data_asset_name", "month"],
Looking back at our sample files, this
regex will result in the
InferredAssetFilesystemDataConnector
providing a single Data Asset, which will
contain three batches. In future workflows
you will be able to refer to a specific
Batch in this Data Asset in a Batch
Request py providing the
data_asset_name
of
"yellow_tripdata_sample_2020"
and one of the following
month
s:
01
02
03
Any characters that are not included
in a group when you define your regex
will still be checked for when
determining if a file name
"matches" the regular
expression. However, those characters
will not be included in any of the
Batch Identifiers, which is why the
-
and
.csv
portions of the
filenames are not found in either the
data_asset_name
or
month
values.
For more information on the special
characters and mechanics of matching
and grouping strings with regular
expressions, please see
the Python documentation on the
re
module.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
"group_names": ["data_asset_name", "month"],
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
"group_names": ["data_asset_name", "month"],
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
In a Configured Asset Data Connector for
filesystem data, each entry in the
assets
dictionary will correspond
to an explicitly defined Data Asset. The key
provided will be used as the name of the Data
Asset, while the value will be a dictionary that
contains two additional keys:
-
pattern
: This is the regex pattern that will define your Data Asset's potential Batch or Batches. -
group_names
: This is a list of names that correspond to the groups you defined inpattern
's regular expression.
The pattern
in each
assets
entry will be matched
against the files in your
base_directory
, and everything that
matches against the pattern
's
value will become a Batch in a Data Asset with a
name matching the key for this entry in the
assets
dictionary.
This means that when configuring your Data Connector's regular expression, you have the option to implement it so that the Data Connector is only capable of returning a single Batch per Data Asset, or so that it is capable of returning multiple Batches grouped into individual Data Assets. Each type of configuration is useful in certain cases, so we will provide examples of both.
If you are uncertain as to which type of configuration is best for your use case, please refer to our guide on how to choose between working with a single or multiple Batches of data.
- Single Batch Configuration
- Multi-Batch Configuration
Because you are explicitly defining each
Data Asset in a
ConfiguredAssetDataConnector
,
it is very easy to define one that can
will only have one Batch.
The simplest way to do this is to define a
Data Asset with a
pattern
value that does not
contain any regex special characters which
would match on more than one value.
For this example, lets assume we have the
following files in our
data
directory:
-
yellow_tripdata_sample_2020-01.csv
-
yellow_tripdata_sample_2020-02.csv
-
yellow_tripdata_sample_2020-03.csv
In this case, we want to define a single
Data Asset for each month. To do so, we
will need an entry in the
assets
dictionary for each
month, as well: one for each Data Asset we
want to create.
Let's walk through the creation of the Data Asset for January's data.
First, you need to add an empty dictionary
entry into the
assets
dictionary. Since the
key you associate with this entry will be
treated as the Data Asset's name, go
ahead and name it
yellow_trip_data_jan
.
At this point, your entry in the `assets dictionary will look like:
"yellow_tripdata_jan": {}
Next, you will need to define the
pattern
value and
group_names
value for this
Data Asset.
Since you want this Data Asset to only
match the file
yellow_tripdata_sample_2020-01.csv
value for the pattern
key
should be one that does not contain any
regex special characters that can match on
more than one value. An example follows:
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
The pattern we defined contains a
regex group, even though we logically
don't need a group to identify
the desired Batch in a Data Asset that
can only return one Batch. This is
because Great Expectations currently
does not permit
pattern
to be defined
without also having
group_names
defined.
Thus, in the example above you are
creating a group that corresponds to
01
so that there is a
valid group to associate a
group_names
entry with.
Since none of the characters in this regex
can possibly match more than one value,
the only file that can possibly be matched
is the one you want it to match:
yellow_tripdata_sample_2020-01.csv
. This batch will also be associated with
the Batch Identifier 01
, but
you won't need to use that to specify
the Batch in a Batch Request as it is the
only Batch that this Data Asset is capable
of returning.
To correspond to the single group that was
defined in your regex, you will define a
single entry in the list for the
group_names
key. Since the
assets
dictionary key is used
for this Data Asset's name, you can
give this group a name relevant to what it
is matching on:
"group_names": ["month"],
Put entirely together, your
assets
entry will look like:
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
}
Looking back at our sample files, this
entry will result in the
ConfiguredAssetFilesystemDataConnector
providing one Data Asset, which can be
accessed by the name
yellow_tripdata_jan
. In
future workflows you will be able to refer
to this Data Asset and its single
corresponding Batch by providing that
name.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
}
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
},
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Because Configured Data Assets require
that you explicitly define each Data
Asset they provide access to, you will
have to add
assets
entries for
February and March if you also want to
access
yellow_tripdata_sample_2020-02.csv
and
yellow_tripdata_sample_2020-03.csv
in the same way.
Configuring a
ConfiguredAssetFilesystemDataConnector
so that its Data Assets are capable of
returning more than one Batch is just a
matter of defining an appropriate regular
expression. For this kind of
configuration, the regular expression you
define should include at least one group
that contains regular expression special
characters capable of matching more than
one value.
For this example, lets assume we have the
following files in our
data
directory:
-
yellow_tripdata_sample_2020-01.csv
-
yellow_tripdata_sample_2020-02.csv
-
yellow_tripdata_sample_2020-03.csv
In this case, we want to define a Data Asset that contains all of our data for the year 2020.
First, you need to add an empty dictionary
entry into the
assets
dictionary. Since the
key you associate with this entry will be
treated as the Data Asset's name, go
ahead and name it
yellow_trip_data_2020
.
At this point, your entry in the `assets dictionary will look like:
"yellow_tripdata_2020": {}
Next, you will need to define the
pattern
value and
group_names
value for this
Data Asset.
Since you want this Data Asset to all of
the 2020 files, the value for
pattern
needs to be a regular
expression that is capable of matching all
of the files. To do this, we will need to
use regular expression special characters
that are capable of matching one more than
one value.
Looking back at the files in our
data
directory, you can see
that each file differs from the others
only in the digits indicating the month of
the file. Therefore, the regular
expression we create will separate those
specific characters into a group, and will
define the content of that group using
special characters capable of matching on
any values, like so:
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
To correspond to the single group that was
defined in your pattern
, you
will define a single entry in the list for
the group_names
key. Since
the assets
dictionary key is
used for this Data Asset's name, you
can give this group a name relevant to
what it is matching on:
"group_names": ["month"],
Since the group in the above regular
expression will match on any characters,
this regex will successfully match on each
of the file names in our
data
directory, and will
associate each file with the identifier
month
that corresponds to the
file's grouped characters:
-
yellow_tripdata_sample_2020_01.csv
will be Batch identified by amonth
value of01
-
yellow_tripdata_sample_2020_02.csv
will be Batch identified by amonth
value of02
-
yellow_tripdata_sample_2020_03.csv
will be Batch identified by amonth
value of03
Put entirely together, your
assets
entry will look like:
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
}
Looking back at our sample files, this
entry will result in the
ConfiguredAssetFilesystemDataConnector
providing one Data Asset, which can be
accessed by the name
yellow_tripdata_2020
. In
future workflows you will be able to refer
to this Data Asset and by providing that
name, and refer to a specific Batch in
this Data Asset by providing your Batch
Request with a
batch_identifier
entry using
the key month
and the value
corresponding to the month portion of the
filename of the file that corresponds to
the Batch in question.
For more information on the special
characters and mechanics of matching
and grouping strings with regular
expressions, please see
the Python documentation on the
re
module.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
}
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
},
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Remember that when you are working
with a Configured Asset Data Connector
you need to explicitly define each of
your Data Assets. So, if you want to
add additional Data Assets, go ahead
and repeat the process of defining an
entry in your configuration's
assets
dictionary to do
so.
Runtime Data Connectors put a wrapper around a single Batch of data, and therefore do not support Data Asset configurations that permit the return of more than one Batch of data. In fact, since you will use a Batch Request to pass in or specify the data that a Runtime Data Connector uses, there is no need to specify a Data Asset configuration at all.
Instead, you will provide a
batch_identifiers
list which will
be used to attach identifying information to a
returned Batch so that you can reference the
same data again in the future.
For this example, lets assume we have the
folloÏwing files in our
data
directory:
-
yellow_tripdata_sample_2020-01.csv
-
yellow_tripdata_sample_2020-02.csv
-
yellow_tripdata_sample_2020-03.csv
With a Runtime Data Connector you won't actually refer to them in your configuration! As mentioned above, you will provide the path or dataframe for one of those files to the Data Connector as part of a Batch Request.
Therefore, the file names are inconsequential to
your Runtime Data Connector's
configuration. In fact, the
batch_identifiers
that you define
in your Runtime Data Connector's
configuration can be completely arbitrary.
However, it is advised you name them after
something meaningful regarding your data or the
circumstances under which you will be accessing
your data.
For instance, let's assume you are getting a daily update to your data, and so you are running daily validations. You could then choose to identify your Runtime Data Connector's Batches by the timestamp at which they are requested.
To do this, you would simply add a
batch_timestamp
entry in your
batch_identifiers
list. This would
look like:
"batch_identifiers": ["batch_timestamp"]
Then, when you create your Batch Request you
would populate the
batch_timestamp
value in its
batch_identifiers
dictionary with
the value of the current date and time. This
will attach the current date and time to the
returned Batch, allowing you to reference the
Batch again in the future even if the current
data (the data that would be provided by the
Runtime Data Connector if you requested a new
Batch) had changed.
The full configuration for your Datasource should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
"batch_identifiers": ["batch_timestamp"],
}
},
}
We stated above that the names that you use
for your batch_identifiers
in a
Runtime Data Connector's configuration
can be completely arbitrary, and will be
used as keys for the
batch_identifiers
dictionary in
future Batch Requests.
However, the same holds true for the
values you pass in for each key in
your Batch Request's
batch_identifiers
!
Always make sure that your Batch Requests
utilizing Runtime Data Connectors are
providing meaningful identifying
information, consistent with the keys that
are derived from the
batch_identifiers
you have
defined in your Runtime Data
Connector's configuration.
10. Test your configuration with
.test_yaml_config(...)
Now that you have a full Datasource configuration, you
can confirm that it is valid by testing it with the
.test_yaml_config(...)
method. To do
this, execute the Python code:
data_context.test_yaml_config(yaml.dump(datasource_config))
When executed, test_yaml_config
will
instantiate the component described by the yaml
configuration that is passed in and then run a self
check procedure to verify that the component works as
expected.
For a Datasource, this includes:
- confirming that the connection works
- gathering a list of available Data Assets
- verifying that at least one Batch can be fetched from the Datasource
For more information on the
.test_yaml_config(...)
method, please see
our guide on
how to configure
DataContext
components using
test_yaml_config
.
11. (Optional) Add more Data Connectors to your configuration
The data_connectors
dictionary in your
datasource_config
can contain multiple
entries. If you want to add additional Data
Connectors, just go through the process starting at
step 7 again.
12. Add your new Datasource to your Data Context
Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:
data_context.add_datasource(**datasource_config)
If the value of
datasource_config["name"]
corresponds to a Datasource that is already
defined in your Data Context, then using the above
command will overwrite the existing Datasource.
If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:
# add_datasource only if it doesn't already exist in your Data Context
try:
data_context.get_datasource(datasource_config["name"])
except ValueError:
data_context.add_datasource(**datasource_config)
else:
print(
f"The datasource {datasource_config['name']} already exists in your Data Context!"
)
Next Steps
Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.
For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.
You can now move forward and create Expectations for your Datasource.