How to configure a DataConnector for splitting and sampling a file system or blob store
This guide will help you configure
Splitting and Sampling for
any files type data store (e.g., filesystem, cloud
blob storage) using a
Configured Asset Data Connector (the same
Splitting and
Sampling configuration options can be
readily applied to an
Inferred Asset Data Connector).
The Splitting and
Sampling mechanisms provided by Great
Expectations serve as additional tools for
Partitioning your data at various levels
of granularity:
-
Splittingprovides the means of focusing the BatchA selection of records from a Data Asset. data on the values of certain dimensions of the data of interest. -
Samplingprovides a means for reducing the amount of data in the retrieved batch to facilitate data analysis.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
- Reviewed How to configure a DataConnector to introspect and partition a file system or blob store
This guide will use the
Data ConnectorProvides the configuration details based on the
source data system which are needed by a
Datasource to define Data Assets.
named configured_data_connector_name that
is part of the following
Datasource configuration as an example:
datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
"""
Preliminary Steps
1. Instantiate your project's DataContext
Import these necessary packages and modules.
import os
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import BatchRequest
Load your DataContext into memory using
the get_context() method.
context = ge.get_context()
2. Configure your Datasource
Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Given that the glob_directive in the
example configuration is *.csv, if you
specified a directory containing CSV files, then you
will see them listed as
Available data_asset_names in the output
of test_yaml_config().
Feel free to adjust your configuration and re-run
test_yaml_config() as needed.
3. Save the Datasource configuration to your DataContext
Save the configuration into your
DataContext by using the
add_datasource() function.
context.add_datasource(**yaml.load(datasource_yaml))
Splitting and Sampling
To configure Splitting, specify a
dimension (i.e., column_name or
column_names), the method of
Splitting, and parameters to be used by
the specified Splitting method.
To configure Sampling, specify the method
of Sampling and parameters to be used by
the specified Sampling method.
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={
"batch_filter_parameters": {
"<YOUR_BATCH_FILTER_PARAMETER_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_VALUE>",
}
},
batch_spec_passthrough={
"splitter_method": "<YOUR_SPLITTER_METHOD>",
"splitter_kwargs": {
"<YOUR_SPLITTER_OBJECTIVE_NAME>": "<YOUR_SPLITTER_OBJECTIVE_KEYS>",
"batch_identifiers": {
"<YOUR_SPLITTER_OBJECTIVE_0_KEY>": "<YOUR_SPLITTER_OBJECTIVE_0_VALUE>",
"<YOUR_SPLITTER_OBJECTIVE_1_KEY>": "<YOUR_SPLITTER_OBJECTIVE_1_VALUE>",
"<YOUR_SPLITTER_OBJECTIVE_2_KEY>": "<YOUR_SPLITTER_OBJECTIVE_2_VALUE>",
# ...
},
},
"sampling_method": "<YOUR_SAMPLING_METHOD>",
"sampling_kwargs": {
"<YOUR_SAMPLING_ARGUMENT_0_NAME>": "<YOUR_SAMPLING_ARGUMENT_0_VALUE>",
"<YOUR_SAMPLING_ARGUMENT_1_NAME>": "<YOUR_SAMPLING_ARGUMENT_1_VALUE>",
"<YOUR_SAMPLING_ARGUMENT_2_NAME>": "<YOUR_SAMPLING_ARGUMENT_2_VALUE>",
# ...
},
},
)
Currently, the configuration of
Splitting and
Sampling as part of the
YAML configuration is not supported;
it must be done using
batch_spec_passthrough as illustrated
above.
To customize the configuration for the present
example, first, specify the
data_connector_query to select the
Batch at the
Partitioning level of granularity.
batch_request.data_connector_query["batch_filter_parameters"] = {
"year": "2020",
"month": "01",
}
Next, specify Splitting and
Sampling directives.
For the present example, we can split according to the "passenger_count" column with the focus on two-passenger rides:
batch_request.batch_spec_passthrough["splitter_method"] = "_split_on_column_value"
batch_request.batch_spec_passthrough["splitter_kwargs"] = {
"column_name": "passenger_count",
"batch_identifiers": {"passenger_count": 2},
}
We can then obtain a random 10% of the
rows in the batch:
batch_request.batch_spec_passthrough["sampling_method"] = "_sample_using_random"
batch_request.batch_spec_passthrough["sampling_kwargs"] = {"p": 1.0e-1}
Finally, confirm the expected number of batches was
retrieved and the reduced size of a batch (due to
Sampling):
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 1
assert batch_list[0].data.dataframe.shape[0] < 200
Additional Notes
Available Splitting methods and their
configuration parameters:
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_whole_table | N/A | identical to original |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_column_value | column_name='col', batch_identifiers={ 'col': value } | rows where value of column_name are equal to value specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_converted_datetime | column_name='col', date_format_string=<'%Y-%m-%d'>, batch_identifiers={ 'col': matching_string } | rows where value of column_name converted to datetime using the given date_format_string are equal to matching string provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_divided_integer | column_name='col', divisor=<int>, batch_identifiers={ 'col': matching_divisor } | rows where value of column_name divided (using integral division) by the given divisor are equal to matching_divisor provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_mod_integer | column_name='col', mod=<int>, batch_identifiers={ 'col': matching_mod_value } | rows where value of column_name divided (using modular division) by the given mod are equal to matching_mod_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_multi_column_values | column_names='<list[col]>', batch_identifiers={ 'col_0': value_0, 'col_1': value_1, 'col_2': value_2, ... } | rows where values of column_names are equal to values corresponding to each column name as specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _split_on_hashed_column | column_name='col', hash_digits=<int>, hash_function_name=<'md5'> batch_identifiers={ 'hash_value': value } | rows where value of column_name hashed (using specified has_function_name) and retaining the stated number of hash_digits are equal to hash_value provided for the column_name specified |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
Available Sampling methods and their
configuration parameters:
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| **Method** | **Parameters** | **Returned Batch Data** |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_random | p=fraction | rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_mod | column_name='col', mod=<int> | take the mod of named column, and only keep rows that match the given value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_a_list | column_name='col', value_list=<list[val]> | match the values in the named column against value_list, and only keep the matches |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
| _sample_using_hash | column_name='col', hash_digits=<int>, hash_value=<str>, hash_function_name=<'md5'> | hash the values in the named column (using specified has_function_name), and only keep rows that match the given hash_value |
+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
To view the full script used in this page, see it on GitHub: