How to organize Batches in a SQL-based Data Asset
In this guide we will demonstrate the ways in which Batches can be organized in a SQL-based Data Asset. We will discuss how to use Splitters to divide the data in a table or query based on the contents of a provided field. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.
Prerequisites
- A working installation of Great Expectations
- A Data Asset in a SQL-based Datasource
If you're using a Datasource that was created with the advanced block-config method, see How to configure a SQL Datasource with the block-config method.
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Retrieve a SQL Datasource and Data Asset
For this guide, we will use a previously defined SQL
Datasource named
"my_datasource"
with a Table
Data Asset called
"my_asset"
which points to a
table with taxi data.
To retrieve this Datasource, we will supply the
get_datasource(...)
method of our Data
Context with the name of the Datasource we wish to
retrieve:
my_datasource = context.get_datasource("my_datasource")
my_table_asset = my_datasource.get_asset(asset_name="version-0.16.16 my_table_asset")
3. Add a Splitter to the Data Asset
Our table has a datetime column called
"pickup_datetime
" which we will
use to split our TableAsset into Batches.
my_table_asset.add_splitter_year_and_month(column_name="version-0.16.16 pickup_datetime")
4. (Optional) Add Batch Sorters to the Data Asset
We will now add Batch Sorters to our Data Asset. This
will allow us to explicitly state the order in which
our Batches are returned when we request data from the
Data Asset. To do this, we will pass a list of sorters
to the add_sorters(...)
method of our
Data Asset.
In this example we split
"pickup_datetime"
column on
"year"
and
"month"
, so our list of sorters
can have up to two elements. We will add an ascending
sorter based on the contents of the splitter group
"year"
and a descending sorter
based on the contents of the splitter group
"month"
:
my_asset = my_table_asset.add_sorters(["+year", "-month"])
5. Use a Batch Request to verify the Data Asset works as desired
To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.
my_batch_request = my_table_asset.build_batch_request()
batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request)
Because a Batch List contains a lot of metadata, it
will be easiest to verify which files were included in
the returned Batches if we only look at the
batch_spec
of each returned Batch:
for batch in batches:
print(batch.batch_spec)
Next steps
Now that you have further configured a file-based Data Asset, you may want to look into: