How to get one or more Batches of data from a configured Datasource
This guide will help you load a BatchA selection of records from a Data Asset. for validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.
A ValidatorUsed to run an Expectation Suite against data. knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
Steps: Loading one or more Batches of data
To load one or more Batch(es), the steps
you will take are the same regardless of the type of
Datasource or
Data Connector you have set up. To learn
more about Datasources,
Data Connectors and
Batch(es) see our
Datasources Guide.
1. Construct a BatchRequest
As outlined in the Datasource and
Data Connector docs mentioned above,
this Batch Request must reference a
previously configured Datasource and
Data Connector.
# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
)
Since a BatchRequest can return multiple
Batch(es), you can optionally provide
additional parameters to filter the retrieved
Batch(es). See
Datasources Guide
for more info on filtering besides
batch_filter_parameters and
limit including custom filter functions
and sampling. The example BatchRequests
below shows several non-exhaustive possibilities.
# Here is an example data_connector_query filtering based on an index which can be
# any valid python slice. The example here is retrieving the latest batch using -1:
data_connector_query_last_index = {
"index": -1,
}
last_index_batch_request = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_last_index,
)
# This BatchRequest adds a query to retrieve only the twelve batches from 2020
data_connector_query_2020 = {
"batch_filter_parameters": {"group_name_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
)
# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020.
# Note that the limit is applied after the data_connector_query filtering. This behavior is
# different than using an index, which is applied before the other query filters.
data_connector_query_2020 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
}
}
batch_request_2020 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
limit=5,
)
# Here is an example data_connector_query filtering based on parameters from group_names
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
"group_name_from_your_data_connector_eg_month": "01",
}
}
batch_request_202001 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
You may also wish to list available batches to verify
that your BatchRequest is retrieving the
correct Batch(es), or to see which are
available. You can use
context.get_batch_list() for this purpose
by passing it your BatchRequest:
batch_list = context.get_batch_list(batch_request=batch_request)
2. Get access to your Batches via a Validator
# Now we can review a sample of data using a Validator
context.add_or_update_expectation_suite(expectation_suite_name="version-0.15.50 test_suite")
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="version-0.15.50 test_suite"
)
3. Check your data
You can check that the Batch(es) that
were loaded into your Validator are what
you expect by running:
print(validator.batches)
You can also check that the first few lines of the
Batch(es) you loaded into your
Validator are what you expect by running:
print(validator.head())
Now that you have a Validator, you can
use it to create Expectations or validate
the data.
To view the full script used in this page, see it on GitHub: