How to connect to one or more files using Pandas
In this guide we will demonstrate how to use
Pandas
to connect to data stored in files on a filesystem. In
this example we will specifically be connecting to
data in .csv
format.
However, GX supports most read methods
available through Pandas.
Prerequisites
- A Great Expectations instance. See Install Great Expectations locally.
- A Data Context.
- Access to source data stored in a filesystem
Steps
1. Import the Great Expectations module and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Create a Datasource
A Filesystem Datasource can be created with two pieces of information:
-
name
: The name by which the Datasource will be referenced in the future -
base_directory
: The path to the folder containing the files the Datasource will be used to connect to
In our example, we will define these in advance by
storing them in the Python variables
datasource_name
and
path_to_folder_containing_csv_files
:
datasource_name = "version-0.16.16 my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
base_directory
of a Filesystem
Datasource
If you are using a Filesystem Data Context you can
provide a path for
base_directory
that is relative to
the folder containing your Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
Once we have determined our name
and
base_directory
, we pass them in as
parameters when we create our Datasource:
datasource = context.sources.add_pandas_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
You can access files that are nested in folders
under your Datasource's
base_directory
!
If your source data files are split into multiple
folders, you can use the folder that contains
those folders as your base_directory
.
When you define a Data Asset for your Datasource,
you can then include the folder path (relative to
your base_directory
) in the regular
expression that indicates which files to connect
to.
3. Add a Data Asset to the Datasource
A Data Asset requires two pieces of information to be defined:
-
name
: The name by which you will reference the Data Asset (for when you have defined multiple Data Assets in the same Datasource) -
batching_regex
: A regular expression that matches the files to be included in the Data Asset
batching_regex
matches
multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Datasource pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a
regex group, such as
r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized
("partitioned") into Batches according
to the two dimensions, defined by the group names,
"year"
and
"month"
. When you send a
Batch Request query featuring this Data Asset in
the future, you can use these group names with
their respective values as options to control
which Batches will be returned. For example, you
could return all Batches in the year of 2021, or
the one Batch for July of 2020.
For this example, we will define these two values in
advance by storing them in the Python variables
asset_name
and (since we are connecting
to NYC taxi data in this example)
batching_regex
:
asset_name = "version-0.16.16 my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
Once we have determined those two values, we will pass them in as parameters when we create our Data Asset:
datasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)
In this example, we are connecting to a
csv file. However, Great Expectations
supports connecting to most types of files that
Pandas has read_*
methods for.
Because you will be using Pandas to connect to
these files, the specific
add_*_asset
methods that will be
available to you will be determined by your
currently installed version of Pandas.
For more information on which Pandas
read_*
methods are available to you
as add_*_asset
methods, please
reference
the official Pandas Input/Output
documentation
for the version of Pandas that you have installed.
In the GX Python API,
add_*_asset
methods will require the
same parameters as the corresponding Pandas
read_*
method, with one caveat: In
Great Expectations, you will also be required to
provide a value for an
asset_name
parameter.
4. Repeat step 3 as needed to add additional files as Data Assets
Your Datasource can contain multiple Data Assets. If
you have additional files to connect to, you can
provide different name
and
batching_regex
parameters to create
additional Data Assets for those files in your
Datasource. You can even include the same files in
multiple Data Assets, if a given file matches the
batching_regex
of more than one Data
Asset.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Additional information
External APIs
For more information on Pandas
read_*
methods, please reference
the official Pandas Input/Output documentation.