How to connect to one or more files using Spark
In this guide we will demonstrate how to use
Spark
to connect to data stored in files on a filesystem. In
this example we will specifically be connecting to
data in .csv
format.
Prerequisites
- A Great Expectations instance. See Install Great Expectations locally.
- A Data Context.
- Access to source data stored in a filesystem
Steps
1. Import the Great Expectations module and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Create a Datasource
A Filesystem Datasource can be created with two pieces of information:
-
name
: The name by which the Datasource will be referenced in the future -
base_directory
: The path to the folder containing the files the Datasource will be used to connect to
In our example, we will define these in advance by
storing them in the Python variables
datasource_name
and
path_to_folder_containing_csv_files
:
datasource_name = "version-0.16.16 my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
base_directory
of a Filesystem
Datasource
If you are using a Filesystem Data Context you can
provide a path for
base_directory
that is relative to
the folder containing your Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
Once we have determined our name
and
base_directory
, we pass them in as
parameters when we create our Datasource:
datasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
You can access files that are nested in folders
under your Datasource's
base_directory
!
If your source data files are split into multiple
folders, you can use the folder that contains
those folders as your base_directory
.
When you define a Data Asset for your Datasource,
you can then include the folder path (relative to
your base_directory
) in the regular
expression that indicates which files to connect
to.
3. Add a Data Asset to the Datasource
A Data Asset requires two pieces of information to be defined:
-
name
: The name by which you will reference the Data Asset (for when you have defined multiple Data Assets in the same Datasource) -
batching_regex
: A regular expression that matches the files to be included in the Data Asset
batching_regex
matches
multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Datasource pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a
regex group, such as
r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized
("partitioned") into Batches according
to the two dimensions, defined by the group names,
"year"
and
"month"
. When you send a
Batch Request query featuring this Data Asset in
the future, you can use these group names with
their respective values as options to control
which Batches will be returned. For example, you
could return all Batches in the year of 2021, or
the one Batch for July of 2020.
For this example, we will define these two values in
advance by storing them in the Python variables
asset_name
and (since we are connecting
to NYC taxi data in this example)
batching_regex
:
asset_name = "version-0.16.16 my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
In addition, the argument header
informs
the Spark DataFrame
reader that the files
contain a header column, while the argument
infer_schema
instructs the Spark
DataFrame
reader to make a best effort to
determine the schema of the columns automatically.
Once we have determined those two values as well as
the optional header
and
infer_schema
arguments, we will pass them
in as parameters when we create our Data Asset:
datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
4. Repeat step 3 as needed to add additional files as Data Assets
Your Datasource can contain multiple Data Assets. If
you have additional files to connect to, you can
provide different name
and
batching_regex
parameters to create
additional Data Assets for those files in your
Datasource. You can even include the same files in
multiple Data Assets, if a given file matches the
batching_regex
of more than one Data
Asset.