How to organize Batches in a file-based Data Asset
In this guide we will demonstrate the ways in which Batches can be organized in a file-based Data Asset. We will discuss how to use a regular expression to indicate which files should be returned as Batches. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.
Prerequisites
- A working installation of Great Expectations
- A Datasource that connects to a location with source data files
If you still need to set up and install GX...
Please reference the appropriate one of these guides:
If you still need to connect a Datasource to the
location of your source data files...
Please reference the appropriate one of these guides:
Local Filesystems
Google Cloud Storage
Azure Blob Storage
- How to connect to data in Azure Blob Storage using Pandas
- How to connect to data in Azure Blob Storage using Spark
Amazon Web Services S3
If you are using a Datasource that was created with the advanced block-config method please follow the appropriate guide from:
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Retrieve a file-based Datasource
For this guide, we will use a previously defined
Datasource named
"my_datasource"
. For purposes
of our demonstration, this Datasource is a Pandas
Filesystem Datasource that uses a folder named
"data" as its base_folder
.
To retrieve this Datasource, we will supply the
get_datasource(...)
method of our Data
Context with the name of the Datasource we wish to
retrieve:
my_datasource = context.get_datasource("my_datasource")
3. Create a batching_regex
In a file-based Data Asset, any file that matches a
provided regular expression (the
batching_regex
parameter) will be
included as a Batch in the Data Asset. Therefore, to
organize multiple files into Batches in a single Data
Asset we must define a regular expression that will
match one or more of our source data files.
For this example, our Datasource points to a folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
To create a batching_regex
that matches
multiple files, we will include a named group in our
regular expression:
my_batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
In the above example, the named group
"year
" will match any four
numeric characters in a file name. This will result in
each of our source data files matching the regular
expression.
By naming the group in your
batching_regex
you make it something
you can reference in the future. When requesting
data from this Data Asset, you can use the names
of your regular expression groups to limit the
Batches that are returned.
For more information, please see: How to request data from a Data Asset
You can access files that are nested in folders
under your Datasource's
base_directory
!
If your source data files are split into multiple
folders, you can use the folder that contains
those folders as your base_directory
.
When you define a Data Asset for your Datasource,
you can then include the folder path (relative to
your base_directory
) in the regular
expression that indicates which files to connect
to.
For more information on how to format regular expressions, we recommend referencing Python's official how-to guide for working with regular expressions.
4. Add a Data Asset using the
batching_regex
Now that we have put together a regular expression
that will match one or more of the files in our
Datasource's base_folder
, we can use
it to create our Data Asset. Since the files in this
particular Datasource's
base_folder
are csv files, we will use
the add_pandas_csv(...)
method of our
Datasource to create the new Data Asset:
my_asset = my_datasource.add_csv_asset(
name="version-0.16.16 my_taxi_data_asset", batching_regex=my_batching_regex
)
batching_regex
?
If you choose to omit the
batching_regex
parameter, your Data
Asset will automatically use the regular
expression ".*"
to match
all files.
5. (Optional) Add Batch Sorters to the Data Asset
We will now add a Batch Sorter to our Data Asset. This
will allow us to explicitly state the order in which
our Batches are returned when we request data from the
Data Asset. To do this, we will pass a list of sorters
to the add_sorters(...)
method of our
Data Asset.
The items in our list of sorters will correspond to
the names of the groups in our
batching_regex
that we want to sort our
Batches on. The names are prefixed with a
+
or a -
depending on if we
want to sort our Batches in ascending or descending
order based on the given group.
When there are multiple named groups, we can include multiple items in our sorter list and our Batches will be returned in the order specified by the list: sorted first according to the first item, then the second, and so forth.
In this example we have two named groups,
"year"
and
"month"
, so our list of sorters
can have up to two elements. We will add an ascending
sorter based on the contents of the regex group
"year"
and a descending sorter
based on the contents of the regex group
"month"
:
my_asset = my_asset.add_sorters(["+year", "-month"])
6. Use a Batch Request to verify the Data Asset works as desired
To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.
my_batch_request = my_asset.build_batch_request()
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)
Because a Batch List contains a lot of metadata, it
will be easiest to verify which files were included in
the returned Batches if we only look at the
batch_spec
of each returned Batch:
for batch in batches:
print(batch.batch_spec)
Next steps
Now that you have further configured a file-based Data Asset, you may want to look into: