Skip to main content
Version: 0.16.16

How to organize Batches in a file-based Data Asset

In this guide we will demonstrate the ways in which Batches can be organized in a file-based Data Asset. We will discuss how to use a regular expression to indicate which files should be returned as Batches. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.

Prerequisites

  • A working installation of Great Expectations
  • A Datasource that connects to a location with source data files

If you still need to set up and install GX...

If you still need to connect a Datasource to the location of your source data files...

Datasources defined with the block-config method

If you are using a Datasource that was created with the advanced block-config method please follow the appropriate guide from:

Steps

1. Import GX and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Retrieve a file-based Datasource

For this guide, we will use a previously defined Datasource named "my_datasource". For purposes of our demonstration, this Datasource is a Pandas Filesystem Datasource that uses a folder named "data" as its base_folder.

To retrieve this Datasource, we will supply the get_datasource(...) method of our Data Context with the name of the Datasource we wish to retrieve:

my_datasource = context.get_datasource("my_datasource")

3. Create a batching_regex

In a file-based Data Asset, any file that matches a provided regular expression (the batching_regex parameter) will be included as a Batch in the Data Asset. Therefore, to organize multiple files into Batches in a single Data Asset we must define a regular expression that will match one or more of our source data files.

For this example, our Datasource points to a folder that contains the following files:

  • "yellow_tripdata_sample_2019-03.csv"
  • "yellow_tripdata_sample_2020-07.csv"
  • "yellow_tripdata_sample_2021-02.csv"

To create a batching_regex that matches multiple files, we will include a named group in our regular expression:

my_batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"

In the above example, the named group "year" will match any four numeric characters in a file name. This will result in each of our source data files matching the regular expression.

Why use named groups?

By naming the group in your batching_regex you make it something you can reference in the future. When requesting data from this Data Asset, you can use the names of your regular expression groups to limit the Batches that are returned.

For more information, please see: How to request data from a Data Asset

What if my source data files are split into different folders?

You can access files that are nested in folders under your Datasource's base_directory!

If your source data files are split into multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Datasource, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

For more information on how to format regular expressions, we recommend referencing Python's official how-to guide for working with regular expressions.

4. Add a Data Asset using the batching_regex

Now that we have put together a regular expression that will match one or more of the files in our Datasource's base_folder, we can use it to create our Data Asset. Since the files in this particular Datasource's base_folder are csv files, we will use the add_pandas_csv(...) method of our Datasource to create the new Data Asset:

my_asset = my_datasource.add_csv_asset(
name="version-0.16.16 my_taxi_data_asset", batching_regex=my_batching_regex
)
What if I don't provide a batching_regex?

If you choose to omit the batching_regex parameter, your Data Asset will automatically use the regular expression ".*" to match all files.

5. (Optional) Add Batch Sorters to the Data Asset

We will now add a Batch Sorter to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the add_sorters(...) method of our Data Asset.

The items in our list of sorters will correspond to the names of the groups in our batching_regex that we want to sort our Batches on. The names are prefixed with a + or a - depending on if we want to sort our Batches in ascending or descending order based on the given group.

When there are multiple named groups, we can include multiple items in our sorter list and our Batches will be returned in the order specified by the list: sorted first according to the first item, then the second, and so forth.

In this example we have two named groups, "year" and "month", so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the regex group "year" and a descending sorter based on the contents of the regex group "month":

my_asset = my_asset.add_sorters(["+year", "-month"])

6. Use a Batch Request to verify the Data Asset works as desired

To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.

my_batch_request = my_asset.build_batch_request()
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)

Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the batch_spec of each returned Batch:

for batch in batches:
print(batch.batch_spec)

Next steps

Now that you have further configured a file-based Data Asset, you may want to look into:

Requesting Data from a Data Asset

Using Data Assets to create Expectations