How to configure a DataConnector to introspect and partition a file system or blob store
This guide will help you introspect and partition any
file type data store (e.g., filesystem, cloud blob
storage) using an Active Data Connector
.
For background on connecting to different backends,
please see the Datasource
specific guides
in the "Connecting to your data" section.
File-based introspection and partitioning are useful for:
- Exploring the types, subdirectory location, and filepath naming structures of the files in your dataset, and
- Organizing the discovered files into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the identified structures.
Partitioning
enables you to select the
desired subsets of your dataset for
Validation.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
We will use the "Yellow Taxi" dataset to
walk you through the configuration of
Data Connectors
. Starting with the
bare-bones version of either an
Inferred Asset Data Connector
or a
Configured Asset Data Connector
, we
gradually build out the configuration to achieve the
introspection of your files with the semantics
consistent with your goals.
To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides a standard API for accessing and interacting with data from a wide variety of source systems., and Batch(es)A selection of records from a Data Asset., please see our Datasources Core Concepts Guide in the Core Concepts reference guide.
Preliminary Steps
1. Instantiate your project's DataContext
Import Great Expectations.
import great_expectations as ge
2. Obtain DataContext
Load your DataContext into memory using the
get_context()
method.
context = ge.get_context()
Configuring Inferred Asset Data Connector and Configured Asset Data Connector
- Inferred Asset Data Connector
- Configured Asset Data Connector
1. Configure your Datasource
Start with an elementary
Datasource
configuration,
containing just one general
Inferred Asset Data Connector
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""
Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Given that the glob_directive
in
the example configuration is *.csv
,
if you specified a directory containing CSV
files, then you will see them listed as
Available data_asset_names
in the
output of test_yaml_config()
.
Feel free to adjust your configuration and
re-run test_yaml_config()
to
experiment as pertinent to your case.
An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
to ensure the correctness of the proposed
YAML
configuration prior to
incorporating it and trying to use it.
For instance, try the following erroneous
DataConnector
configuration as part
of your Datasource
(you can paste
it directly underneath -- or instead of -- the
default_inferred_data_connector_name
configuration section):
buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names: # required "data_asset_name" reserved group name for "InferredAssetFilePathDataConnector" is absent
- nonexistent_group_name
"""
Then add in the path to a directory that contains your data, and again run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports only one
data_asset_name
, called
DEFAULT_ASSET_NAME
, signaling a
misconfiguration.
Now try another erroneous
DataConnector
configuration as part
of your Datasource
(you can paste
it directly underneath -- or instead of -- your
existing
DataConnector
configuration
sections):
another_buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_BAD_DATA_DIRECTORY_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""
where you would add in the path to a directory that does not exist; then run this code again to test your configuration:
context.test_yaml_config(datasource_yaml)
You will see that the list of
Data Assets
is empty. Feel free to
experiment with the arguments to
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
For instance, running
context.test_yaml_config(yaml_string, return_mode="report_object")
will return the information appearing in
standard output converted to the
Python
dictionary format.
Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running
context.test_yaml_config(yaml_string, shorten_tracebacks=True)
showing the line numbers, where the exception
occurred, most likely caused by the failure of
the required class (in this case
InferredAssetFilesystemDataConnector
) from being successfully instantiated.
2. Save the Datasource configuration to your DataContext
Once the basic
Datasource
configuration is
error-free and satisfies your requirements, save
it into your DataContext
by using
the add_datasource()
function.
context.add_datasource(**yaml.load(datasource_yaml))
3. Get names of available Data Assets
Getting names of available data assets using an
Inferred Asset Data Connector
affords you the visibility into types and naming
structures of files in your filesystem or blob
storage:
available_data_asset_names = context.datasources[
"taxi_datasource"
].get_available_data_asset_names(
data_connector_names="default_inferred_data_connector_name"
)[
"default_inferred_data_connector_name"
]
assert len(available_data_asset_names) == 36
1. Add Configured Asset Data Connector to your Datasource
Set up the bare-bones
Configured Asset Data Connector
to
gradually apply structure to the discovered
assets and partition them according to this
structure. To begin, add the following
configured_data_connector_name
section to your
Datasource
configuration (please
feel free to change the name as you deem
appropriate for your use case):
datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets: {{}}
"""
Now run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
The message
Available data_asset_names (0 of 0)
, corresponding to the
configured_data_connector_name
Data Connector
, should appear in
standard output, correctly reflecting the fact
that the assets
section of the
configuration is empty.
2. Add a Data Asset for Configured Asset Data Connector to partition only by file name and type
You can employ a data asset that reflects a
relatively general file structure (e.g.,
taxi_data_flat
in the example
configuration) to represent files in a
directory, which contain a certain prefix (e.g.,
yellow_tripdata_sample_
) and whose
contents are of the desired type (e.g., CSV).
configured_data_connector_yaml = f"""
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
"""
Now run test_yaml_config()
as part
of evolving and testing components of Great
Expectations YAML
configuration:
context.test_yaml_config(datasource_yaml)
Verify that exactly one
Data Asset
is reported for the
configured_data_connector_name
Data Connector
and that the
structure of the file names corresponding to the
Data Asset
identified,
taxi_data_flat
, is consistent with
the regular expressions pattern specified in the
configuration for this Data Asset
.
3. Add a Data Asset for Configured Asset Data Connector to partition by year and month
In recognition of a finer observed file path
structure, you can refine the partitioning
strategy. For instance, the
taxi_data_year_month
in the
following example configuration identifies three
parts of a file path: name
(as in
"company name"), year
,
and month
:
configured_data_connector_yaml = f"""
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
"""
and run
context.test_yaml_config(datasource_yaml)
Verify that now two
Data Assets
(taxi_data_flat
and taxi_data_year_month
) are
reported for the
configured_data_connector_name
Data Connector
and that the
structures of the file names corresponding to
the two Data Assets
identified are
consistent with the regular expressions patterns
specified in the configuration for these
Data Assets
.
This partitioning affords a rich set of filtering capabilities ranging from specifying the exact values of the file name structure's components to allowing Python functions for implementing custom criteria.
Finally, once your
Data Connector
configuration
satisfies your requirements, save the enclosing
Datasource
into your
DataContext
using
context.add_datasource(**yaml.load(datasource_yaml))
Consult the
How to get a Batch of data from a configured
Datasource
guide for examples of considerable flexibility
in querying Batch
objects along the
different dimensions materialized as a result of
partitioning the dataset as specified by the
taxi_data_flat
and
taxi_data_year_month
Data Assets
.
To view the full scripts used in this page, see them on GitHub: