Version: 0.17.23

Organize Batches in a file-based Data Asset

This guide demonstrates how to organize Batches in a file-based Data Asset. This includes how to use a regular expression to indicate which files should be returned as Batches and how to add Batch Sorters to a Data Asset to specify the order in which Batches are returned.

Datasources defined with the block-config method

If you are using a Data Source that was created with the advanced block-config method, see the following resources:

Prerequisites

A working installation of Great Expectations
A Data Source that connects to a location with source data files

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

import great_expectations as gx

context = gx.get_context()

Retrieve a file-based Data Source

For this guide, we will use a previously defined Data Source named "my_datasource". For purposes of our demonstration, this Data Source is a Pandas Filesystem Data Source that uses a folder named "data" as its base_folder.

To retrieve this Data Source, we will supply the get_datasource(...) method of our Data Context with the name of the Data Source we wish to retrieve:

my_datasource = context.get_datasource("my_datasource")

Create a `batching_regex`

In a file-based Data Asset, any file that matches a provided regular expression (the batching_regex parameter) will be included as a Batch in the Data Asset. Therefore, to organize multiple files into Batches in a single Data Asset we must define a regular expression that will match one or more of our source data files.

For this example, our Data Source points to a folder that contains the following files:

"yellow_tripdata_sample_2019-03.csv"
"yellow_tripdata_sample_2020-07.csv"
"yellow_tripdata_sample_2021-02.csv"

To create a batching_regex that matches multiple files, we will include a named group in our regular expression:

my_batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"

In the above example, the named group "year" will match any four numeric characters in a file name. This will result in each of our source data files matching the regular expression.

Why use named groups?

By naming the group in your batching_regex you make it something you can reference in the future. When requesting data from this Data Asset, you can use the names of your regular expression groups to limit the Batches that are returned.

For more information, please see: How to request data from a Data Asset

What if my source data files are split into different folders?

You can access files that are nested in folders under your Data Source's base_directory!

If your source data files are split into multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Data Source, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

For more information on how to format regular expressions, we recommend referencing Python's official how-to guide for working with regular expressions.

Add a Data Asset using the `batching_regex`

Now that we have put together a regular expression that will match one or more of the files in our Data Source's base_folder, we can use it to create our Data Asset. Since the files in this particular Data Source's base_folder are csv files, we will use the add_pandas_csv(...) method of our Data Source to create the new Data Asset:

my_asset = my_datasource.add_csv_asset(
    name="my_taxi_data_asset", batching_regex=my_batching_regex
)

What if I don't provide a batching_regex?

If you choose to omit the batching_regex parameter, your Data Asset will automatically use the regular expression ".*" to match all files.

Add Batch Sorters to the Data Asset (Optional)

We will now add a Batch Sorter to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the add_sorters(...) method of our Data Asset.

The items in our list of sorters will correspond to the names of the groups in our batching_regex that we want to sort our Batches on. The names are prefixed with a + or a - depending on if we want to sort our Batches in ascending or descending order based on the given group.

When there are multiple named groups, we can include multiple items in our sorter list and our Batches will be returned in the order specified by the list: sorted first according to the first item, then the second, and so forth.

In this example we have two named groups, "year" and "month", so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the regex group "year" and a descending sorter based on the contents of the regex group "month":

my_asset = my_asset.add_sorters(["+year", "-month"])

Use a Batch Request to verify the Data Asset works as desired

To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.

my_batch_request = my_asset.build_batch_request()
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)

Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the batch_spec of each returned Batch:

for batch in batches:
    print(batch.batch_spec)

Prerequisites​

Import GX and instantiate a Data Context​

Retrieve a file-based Data Source​

Create a batching_regex​

Add a Data Asset using the batching_regex​

Add Batch Sorters to the Data Asset (Optional)​

Use a Batch Request to verify the Data Asset works as desired​

Related documentation​