Skip to main content
Version: 0.16.16

How to organize Batches in a SQL-based Data Asset

In this guide we will demonstrate the ways in which Batches can be organized in a SQL-based Data Asset. We will discuss how to use Splitters to divide the data in a table or query based on the contents of a provided field. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.

Prerequisites

  • A working installation of Great Expectations
  • A Data Asset in a SQL-based Datasource
Datasources defined with the block-config method

If you're using a Datasource that was created with the advanced block-config method, see How to configure a SQL Datasource with the block-config method.

Steps

1. Import GX and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Retrieve a SQL Datasource and Data Asset

For this guide, we will use a previously defined SQL Datasource named "my_datasource" with a Table Data Asset called "my_asset" which points to a table with taxi data.

To retrieve this Datasource, we will supply the get_datasource(...) method of our Data Context with the name of the Datasource we wish to retrieve:

my_datasource = context.get_datasource("my_datasource")
my_table_asset = my_datasource.get_asset(asset_name="version-0.16.16 my_table_asset")

3. Add a Splitter to the Data Asset

Our table has a datetime column called "pickup_datetime" which we will use to split our TableAsset into Batches.

my_table_asset.add_splitter_year_and_month(column_name="version-0.16.16 pickup_datetime")

4. (Optional) Add Batch Sorters to the Data Asset

We will now add Batch Sorters to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the add_sorters(...) method of our Data Asset.

In this example we split "pickup_datetime" column on "year" and "month", so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the splitter group "year" and a descending sorter based on the contents of the splitter group "month":

my_asset = my_table_asset.add_sorters(["+year", "-month"])

5. Use a Batch Request to verify the Data Asset works as desired

To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.

my_batch_request = my_table_asset.build_batch_request()
batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request)

Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the batch_spec of each returned Batch:

for batch in batches:
print(batch.batch_spec)

Next steps

Now that you have further configured a file-based Data Asset, you may want to look into:

Requesting Data from a Data Asset

Using Data Assets to create Expectations