How to get a Batch of data from a configured Datasource
This guide will help you load a BatchA selection of records from a Data Asset. for introspection and validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.
What used to be called a “Batch” in the old API was replaced with ValidatorUsed to run an Expectation Suite against data.. A Validator knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.
You can read more about the core classes that make Great Expectations run in our Core Concepts reference guide.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
Steps: Loading a Batch of data
To load a Batch
, the steps you will take
are the same regardless of the type of
Datasource
or
Data Connector
you have set up. To learn
more about Datasources
,
Data Connectors
and
Batch(es)
see our
Datasources Core Concepts Guide
in the
Core Concepts reference guide.
1. Construct a BatchRequest
# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
)
Since a BatchRequest
can return multiple
Batch(es)
, you can optionally provide
additional parameters to filter the retrieved
Batch(es)
. See
Datasources Core Concepts Guide
for more info on filtering besides
batch_filter_parameters
and
limit
including custom filter functions
and sampling. The example BatchRequest
s
below shows several non-exhaustive possibilities.
# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020
data_connector_query_2020 = {
"batch_filter_parameters": {"param_1_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
limit=5,
)
# Here is an example `data_connector_query` filtering based on parameters from `group_names`
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
"batch_filter_parameters": {
"param_1_from_your_data_connector_eg_year": "2020",
"param_2_from_your_data_connector_eg_month": "01",
}
}
# This BatchRequest will use the above filter to retrieve only the batch from Jan 2020
batch_request_202001 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
# Here is an example `data_connector_query` filtering based on an `index` which can be
# any valid python slice. The example here is retrieving the latest batch using `-1`:
data_connector_query_last_index = {
"index": -1,
}
last_index_batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_last_index,
)
You may also wish to list available batches to verify
that your BatchRequest
is retrieving the
correct Batch(es)
, or to see which are
available. You can use
context.get_batch_list()
for this
purpose, which can take a variety of flexible input
types similar to a BatchRequest
. Some
examples are shown below:
# List all Batches associated with the DataAsset
batch_list_all_a = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
)
# Alternatively you can use the previously created batch_request to achieve the same thing
batch_list_all_b = context.get_batch_list(batch_request=batch_request)
# You can use a query to filter the batch_list
batch_list_202001_query = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
# Or limit to a specific number of batches
batch_list_all_limit_10 = context.get_batch_list(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
limit=10,
)
2. Get access to your Batch via a Validator
# First create an expectation suite to use with our validator
context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
# Now create our validator
validator = context.get_validator(
batch_request=last_index_batch_request, expectation_suite_name="test_suite"
)
3. Check your data
You can check that the first few lines of the
Batch
you loaded into your
Validator
are what you expect by running:
print(validator.head())
Now that you have a Validator
, you can
use it to create Expectations
or validate
the data.
Additional Batch querying and loading examples
We will use the "Yellow Taxi" dataset
example from
How to configure a DataConnector to introspect and
partition a file system or blob store
to demonstrate the Batch
querying
possibilities enabled by the particular data
partitioning strategy specified as part of the
Data Connector
configuration.
1. Partition only by file name and type
In this example, the
Data AssetA collection of records within a Datasource which
is usually named based on the underlying data
system and sliced to correspond to a desired
specification.
representing a relatively general naming structure of
files in a directory, with each file name having a
certain prefix (e.g.,
yellow_tripdata_sample_
) and whose
contents are of the desired type (e.g., CSV) is
taxi_data_flat
in the
Data Connector
configured_data_connector_name
:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
To query for Batch
objects, set
data_asset_name
to
taxi_data_flat
in the following
BatchRequest
specification. (Customize
for your own data set, as appropriate.)
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
)
Then perform the relevant checks: verify that the
expected number of Batch
objects was
retrieved and confirm the size of a
Batch
. For example (be sure to adjust
this code to match the specifics of your data and
configuration):
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 12
assert batch_list[0].data.dataframe.shape[0] == 10000
2. Partition by year and month
Next, use the more detailed partitioning strategy
represented by the Data Asset
taxi_data_year_month
in the
Data Connector
configured_data_connector_name
:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <PATH_TO_YOUR_DATA_HERE>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
The Data Asset
taxi_data_year_month
in the above example
configuration identifies three parts of a file path:
name
(as in "company name"),
year
, and month
. This
partitioning affords a rich set of filtering
capabilities ranging from specifying the exact values
of the file name structure's components to
allowing Python functions for implementing custom
criteria.
To perform experiments supported by this
configuration, set data_asset_name
to
taxi_data_year_month
in the following
BatchRequest
specification (customize for
your own data set, as appropriate):
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={"custom_filter_function": "<YOUR_CUSTOM_FILTER_FUNCTION>"},
)
To obtain the data for the nine months of February through October, apply the following custom filter:
batch_request.data_connector_query["custom_filter_function"] = (
lambda batch_identifiers: batch_identifiers["name"] == "yellow"
and 1 < int(batch_identifiers["month"]) < 11
)
Now, perform the relevant checks: verify that the
expected number of Batch
objects was
retrieved and confirm the size of a
Batch
:
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 9
assert batch_list[0].data.dataframe.shape[0] == 10000
You can then identify a particular
Batch
(e.g., corresponding to the year
and month of interest) and retrieve it for data
analysis as follows:
batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="configured_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
data_connector_query={
"batch_filter_parameters": {
"<YOUR_BATCH_FILTER_PARAMETER_0_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_0_VALUE>",
"<YOUR_BATCH_FILTER_PARAMETER_1_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_1_VALUE>",
"<YOUR_BATCH_FILTER_PARAMETER_2_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_2_VALUE>",
}
},
)
Note that in the present example, there can be up to
three BATCH_FILTER_PARAMETER
key-value
pairs, because the regular expression for the data
asset taxi_data_year_month
defines three
groups: name
, year
, and
month
.
batch_request.data_connector_query["batch_filter_parameters"] = {
"year": "2020",
"month": "01",
}
(Be sure to adjust the above code snippets to match the specifics of your data and configuration.)
Now, perform the relevant checks: verify that the
expected number of Batch
objects was
retrieved and confirm the size of a
Batch
:
batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 1
assert batch_list[0].data.dataframe.shape[0] == 10000
Omitting the
batch_filter_parameters
key from the
data_connector_query
will be
interpreted in the least restrictive (most broad)
query, resulting in the largest number of
Batch
objects to be returned.
To view the full script used in this page, see it on GitHub: