Skip to main content
Version: 0.16.16

How to connect to one or more files using Spark

In this guide we will demonstrate how to use Spark to connect to data stored in files on a filesystem. In this example we will specifically be connecting to data in .csv format.

Prerequisites

Steps

1. Import the Great Expectations module and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Create a Datasource

A Filesystem Datasource can be created with two pieces of information:

  • name: The name by which the Datasource will be referenced in the future
  • base_directory: The path to the folder containing the files the Datasource will be used to connect to

In our example, we will define these in advance by storing them in the Python variables datasource_name and path_to_folder_containing_csv_files:

datasource_name = "version-0.16.16 my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
Using relative paths as the base_directory of a Filesystem Datasource

If you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context.

However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.

Once we have determined our name and base_directory, we pass them in as parameters when we create our Datasource:

datasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
What if my source data files are split into different folders?

You can access files that are nested in folders under your Datasource's base_directory!

If your source data files are split into multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Datasource, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

3. Add a Data Asset to the Datasource

A Data Asset requires two pieces of information to be defined:

  • name: The name by which you will reference the Data Asset (for when you have defined multiple Data Assets in the same Datasource)
  • batching_regex: A regular expression that matches the files to be included in the Data Asset
What if the batching_regex matches multiple files?

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.

For example:

Let's say that you have a filesystem Datasource pointing to a base folder that contains the following files:

  • "yellow_tripdata_sample_2019-03.csv"
  • "yellow_tripdata_sample_2020-07.csv"
  • "yellow_tripdata_sample_2021-02.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned. For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.

For this example, we will define these two values in advance by storing them in the Python variables asset_name and (since we are connecting to NYC taxi data in this example) batching_regex:

asset_name = "version-0.16.16 my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"

In addition, the argument header informs the Spark DataFrame reader that the files contain a header column, while the argument infer_schema instructs the Spark DataFrame reader to make a best effort to determine the schema of the columns automatically.

Once we have determined those two values as well as the optional header and infer_schema arguments, we will pass them in as parameters when we create our Data Asset:

datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)

4. Repeat step 3 as needed to add additional files as Data Assets

Your Datasource can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Datasource. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.

Next steps