Step 2: Connect to data
How-to guides
Core skills
How to get a Batch of data from a configured Datasource

Version: 0.14.13

How to get a Batch of data from a configured Datasource

This guide will help you load a BatchA selection of records from a Data Asset. for introspection and validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.

What used to be called a “Batch” in the old API was replaced with ValidatorUsed to run an Expectation Suite against data.. A Validator knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.

You can read more about the core classes that make Great Expectations run in our Core Concepts reference guide.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Configured and loaded a Data Context
Configured a Datasource and Data Connector

Steps: Loading a Batch of data

To load a Batch, the steps you will take are the same regardless of the type of Datasource or Data Connector you have set up. To learn more about Datasources, Data Connectors and Batch(es) see our Datasources Core Concepts Guide in the Core Concepts reference guide.

1. Construct a BatchRequest

# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
)

                              
                            

Since a BatchRequest can return multiple Batch(es), you can optionally provide additional parameters to filter the retrieved Batch(es). See Datasources Core Concepts Guide for more info on filtering besides batch_filter_parameters and limit including custom filter functions and sampling. The example BatchRequests below shows several non-exhaustive possibilities.

# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020
data_connector_query_2020 = {
    "batch_filter_parameters": {"param_1_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
    data_connector_query=data_connector_query_2020,
    limit=5,
)

                              
                            

# Here is an example `data_connector_query` filtering based on parameters from `group_names`
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
    "batch_filter_parameters": {
        "param_1_from_your_data_connector_eg_year": "2020",
        "param_2_from_your_data_connector_eg_month": "01",
    }
}
# This BatchRequest will use the above filter to retrieve only the batch from Jan 2020
batch_request_202001 = BatchRequest(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
    data_connector_query=data_connector_query_202001,
)

                              
                            

# Here is an example `data_connector_query` filtering based on an `index` which can be
# any valid python slice. The example here is retrieving the latest batch using `-1`:
data_connector_query_last_index = {
    "index": -1,
}
last_index_batch_request = BatchRequest(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
    data_connector_query=data_connector_query_last_index,
)

                              
                            

You may also wish to list available batches to verify that your BatchRequest is retrieving the correct Batch(es), or to see which are available. You can use context.get_batch_list() for this purpose, which can take a variety of flexible input types similar to a BatchRequest. Some examples are shown below:

# List all Batches associated with the DataAsset
batch_list_all_a = context.get_batch_list(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
)

                              
                            

# Alternatively you can use the previously created batch_request to achieve the same thing
batch_list_all_b = context.get_batch_list(batch_request=batch_request)

# You can use a query to filter the batch_list
batch_list_202001_query = context.get_batch_list(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
    data_connector_query=data_connector_query_202001,
)

                              
                            

# Or limit to a specific number of batches
batch_list_all_limit_10 = context.get_batch_list(
    datasource_name="insert_your_datasource_name_here",
    data_connector_name="insert_your_data_connector_name_here",
    data_asset_name="insert_your_data_asset_name_here",
    limit=10,
)

                              
                            

2. Get access to your Batch via a Validator

# First create an expectation suite to use with our validator
context.create_expectation_suite(
    expectation_suite_name="test_suite", overwrite_existing=True
)
# Now create our validator
validator = context.get_validator(
    batch_request=last_index_batch_request, expectation_suite_name="test_suite"
)

                              
                            

3. Check your data

You can check that the first few lines of the Batch you loaded into your Validator are what you expect by running:

print(validator.head())

Now that you have a Validator, you can use it to create Expectations or validate the data.

Additional Batch querying and loading examples

We will use the "Yellow Taxi" dataset example from How to configure a DataConnector to introspect and partition a file system or blob store to demonstrate the Batch querying possibilities enabled by the particular data partitioning strategy specified as part of the Data Connector configuration.

1. Partition only by file name and type

In this example, the Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. representing a relatively general naming structure of files in a directory, with each file name having a certain prefix (e.g., yellow_tripdata_sample_) and whose contents are of the desired type (e.g., CSV) is taxi_data_flat in the Data Connector configured_data_connector_name:

configured_data_connector_name:
        class_name: ConfiguredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        assets:
          taxi_data_flat:
            base_directory: samples_2020
            pattern: (yellow_tripdata_sample_.+)\\.csv
            group_names:
              - filename

                              
                            

To query for Batch objects, set data_asset_name to taxi_data_flat in the following BatchRequest specification. (Customize for your own data set, as appropriate.)

batch_request = BatchRequest(
    datasource_name="taxi_datasource",
    data_connector_name="configured_data_connector_name",
    data_asset_name="<YOUR_DATA_ASSET_NAME>",
)

                              
                            

Then perform the relevant checks: verify that the expected number of Batch objects was retrieved and confirm the size of a Batch. For example (be sure to adjust this code to match the specifics of your data and configuration):

batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 12
assert batch_list[0].data.dataframe.shape[0] == 10000

2. Partition by year and month

Next, use the more detailed partitioning strategy represented by the Data Asset taxi_data_year_month in the Data Connector configured_data_connector_name:

configured_data_connector_name:
        class_name: ConfiguredAssetFilesystemDataConnector
        base_directory: <PATH_TO_YOUR_DATA_HERE>
        glob_directive: "*.csv"
        default_regex:
          pattern: (.*)
          group_names:
            - data_asset_name
        assets:
          taxi_data_flat:
            base_directory: samples_2020
            pattern: (yellow_tripdata_sample_.+)\\.csv
            group_names:
              - filename
          taxi_data_year_month:
            base_directory: samples_2020
            pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
            group_names:
              - name
              - year
              - month

                              
                            

The Data Asset taxi_data_year_month in the above example configuration identifies three parts of a file path: name (as in "company name"), year, and month. This partitioning affords a rich set of filtering capabilities ranging from specifying the exact values of the file name structure's components to allowing Python functions for implementing custom criteria.

To perform experiments supported by this configuration, set data_asset_name to taxi_data_year_month in the following BatchRequest specification (customize for your own data set, as appropriate):

batch_request = BatchRequest(
    datasource_name="taxi_datasource",
    data_connector_name="configured_data_connector_name",
    data_asset_name="<YOUR_DATA_ASSET_NAME>",
    data_connector_query={"custom_filter_function": "<YOUR_CUSTOM_FILTER_FUNCTION>"},
)

                              
                            

To obtain the data for the nine months of February through October, apply the following custom filter:

batch_request.data_connector_query["custom_filter_function"] = (
    lambda batch_identifiers: batch_identifiers["name"] == "yellow"
    and 1 < int(batch_identifiers["month"]) < 11
)

Now, perform the relevant checks: verify that the expected number of Batch objects was retrieved and confirm the size of a Batch:

batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 9
assert batch_list[0].data.dataframe.shape[0] == 10000

You can then identify a particular Batch (e.g., corresponding to the year and month of interest) and retrieve it for data analysis as follows:

batch_request = BatchRequest(
    datasource_name="taxi_datasource",
    data_connector_name="configured_data_connector_name",
    data_asset_name="<YOUR_DATA_ASSET_NAME>",
    data_connector_query={
        "batch_filter_parameters": {
            "<YOUR_BATCH_FILTER_PARAMETER_0_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_0_VALUE>",
            "<YOUR_BATCH_FILTER_PARAMETER_1_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_1_VALUE>",
            "<YOUR_BATCH_FILTER_PARAMETER_2_KEY>": "<YOUR_BATCH_FILTER_PARAMETER_2_VALUE>",
        }
    },
)

                              
                            

Note that in the present example, there can be up to three BATCH_FILTER_PARAMETER key-value pairs, because the regular expression for the data asset taxi_data_year_month defines three groups: name, year, and month.

batch_request.data_connector_query["batch_filter_parameters"] = {
    "year": "2020",
    "month": "01",
}

(Be sure to adjust the above code snippets to match the specifics of your data and configuration.)

Now, perform the relevant checks: verify that the expected number of Batch objects was retrieved and confirm the size of a Batch:

batch_list = context.get_batch_list(batch_request=batch_request)
assert len(batch_list) == 1
assert batch_list[0].data.dataframe.shape[0] == 10000

note

Omitting the batch_filter_parameters key from the data_connector_query will be interpreted in the least restrictive (most broad) query, resulting in the largest number of Batch objects to be returned.

To view the full script used in this page, see it on GitHub:

yaml_example_complete.py

Prerequisites: This how-to guide assumes you have:

Steps: Loading a Batch of data​

1. Construct a BatchRequest​

2. Get access to your Batch via a Validator​

3. Check your data​

Additional Batch querying and loading examples​

1. Partition only by file name and type​

2. Partition by year and month​

Steps: Loading a Batch of data

1. Construct a BatchRequest

2. Get access to your Batch via a Validator

3. Check your data

Additional Batch querying and loading examples

1. Partition only by file name and type

2. Partition by year and month