How to get one or more Batches of data from a configured Datasource
This guide will help you load a BatchA selection of records from a Data Asset. for validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.
A ValidatorUsed to run an Expectation Suite against data. knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
Steps: Loading one or more Batches of data
To load one or more Batch(es)
, the steps
you will take are the same regardless of the type of
Datasource
or
Data Connector
you have set up. To learn
more about Datasources
,
Data Connectors
and
Batch(es)
see our
Datasources Guide.
1. Construct a BatchRequest
As outlined in the Datasource
and
Data Connector
docs mentioned above,
this Batch Request
must reference a
previously configured Datasource
and
Data Connector
.
# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
)
Since a BatchRequest
can return multiple
Batch(es)
, you can optionally provide
additional parameters to filter the retrieved
Batch(es)
. See
Datasources Guide
for more info on filtering besides
batch_filter_parameters
and
limit
including custom filter functions
and sampling. The example BatchRequest
s
below shows several non-exhaustive possibilities.
# Here is an example data_connector_query filtering based on an index which can be
# any valid python slice. The example here is retrieving the latest batch using -1:
data_connector_query_last_index = {
"index": -1,
}
last_index_batch_request = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_last_index,
)
# This BatchRequest adds a query to retrieve only the twelve batches from 2020
data_connector_query_2020 = {
"batch_filter_parameters": {"group_name_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
)
# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020.
# Note that the limit is applied after the data_connector_query filtering. This behavior is
# different than using an index, which is applied before the other query filters.
data_connector_query_2020 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
}
}
batch_request_2020 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
limit=5,
)
# Here is an example data_connector_query filtering based on parameters from group_names
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
"group_name_from_your_data_connector_eg_month": "01",
}
}
batch_request_202001 = BatchRequest(
datasource_name="version-0.15.50 insert_your_datasource_name_here",
data_connector_name="version-0.15.50 insert_your_data_connector_name_here",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)
You may also wish to list available batches to verify
that your BatchRequest
is retrieving the
correct Batch(es)
, or to see which are
available. You can use
context.get_batch_list()
for this purpose
by passing it your BatchRequest
:
batch_list = context.get_batch_list(batch_request=batch_request)
2. Get access to your Batches via a Validator
# Now we can review a sample of data using a Validator
context.add_or_update_expectation_suite(expectation_suite_name="version-0.15.50 test_suite")
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="version-0.15.50 test_suite"
)
3. Check your data
You can check that the Batch(es)
that
were loaded into your Validator
are what
you expect by running:
print(validator.batches)
You can also check that the first few lines of the
Batch(es)
you loaded into your
Validator
are what you expect by running:
print(validator.head())
Now that you have a Validator
, you can
use it to create Expectations
or validate
the data.
To view the full script used in this page, see it on GitHub: