Version: 0.16.16

How to connect to one or more files using Pandas

In this guide we will demonstrate how to use Pandas to connect to data stored in files on a filesystem. In this example we will specifically be connecting to data in .csv format. However, GX supports most read methods available through Pandas.

Prerequisites

A Great Expectations instance. See Install Great Expectations locally.
A Data Context.
Access to source data stored in a filesystem

Steps

1. Import the Great Expectations module and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Create a Datasource

A Filesystem Datasource can be created with two pieces of information:

name: The name by which the Datasource will be referenced in the future
base_directory: The path to the folder containing the files the Datasource will be used to connect to

In our example, we will define these in advance by storing them in the Python variables datasource_name and path_to_folder_containing_csv_files:

datasource_name = "version-0.16.16 my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"

Using relative paths as the base_directory of a Filesystem Datasource

If you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context.

However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.

Once we have determined our name and base_directory, we pass them in as parameters when we create our Datasource:

datasource = context.sources.add_pandas_filesystem(
    name=datasource_name, base_directory=path_to_folder_containing_csv_files
)

What if my source data files are split into different folders?

You can access files that are nested in folders under your Datasource's base_directory!

If your source data files are split into multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Datasource, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

3. Add a Data Asset to the Datasource

A Data Asset requires two pieces of information to be defined:

name: The name by which you will reference the Data Asset (for when you have defined multiple Data Assets in the same Datasource)
batching_regex: A regular expression that matches the files to be included in the Data Asset

What if the batching_regex matches multiple files?

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.

For example:

Let's say that you have a filesystem Datasource pointing to a base folder that contains the following files:

"yellow_tripdata_sample_2019-03.csv"
"yellow_tripdata_sample_2020-07.csv"
"yellow_tripdata_sample_2021-02.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned. For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.

For this example, we will define these two values in advance by storing them in the Python variables asset_name and (since we are connecting to NYC taxi data in this example) batching_regex:

asset_name = "version-0.16.16 my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"

Once we have determined those two values, we will pass them in as parameters when we create our Data Asset:

datasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)

Using Pandas to connect to different file types

In this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has read_* methods for.

Because you will be using Pandas to connect to these files, the specific add_*_asset methods that will be available to you will be determined by your currently installed version of Pandas.

For more information on which Pandas read_* methods are available to you as add_*_asset methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.

In the GX Python API, add_*_asset methods will require the same parameters as the corresponding Pandas read_* method, with one caveat: In Great Expectations, you will also be required to provide a value for an asset_name parameter.

4. Repeat step 3 as needed to add additional files as Data Assets

Your Datasource can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Datasource. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.

Next steps

Additional information

External APIs

For more information on Pandas read_* methods, please reference the official Pandas Input/Output documentation.

Local Filesystems

Google Cloud Storage

Azure Blob Storage

Amazon Web Services S3

How to connect to one or more files using Pandas

Prerequisites

Steps

1. Import the Great Expectations module and instantiate a Data Context

2. Create a Datasource

3. Add a Data Asset to the Datasource

4. Repeat step 3 as needed to add additional files as Data Assets

Next steps

Additional information

External APIs

Prerequisites​

Steps​

1. Import the Great Expectations module and instantiate a Data Context​

2. Create a Datasource​

3. Add a Data Asset to the Datasource​

4. Repeat step 3 as needed to add additional files as Data Assets​

Next steps​

Additional information​

External APIs​

Prerequisites

Steps

1. Import the Great Expectations module and instantiate a Data Context

2. Create a Datasource

3. Add a Data Asset to the Datasource

4. Repeat step 3 as needed to add additional files as Data Assets

Next steps

Additional information

External APIs