How to connect to data on GCS using Spark
In this guide we will demonstrate how to use Spark to connect to data stored on Google Cloud Storage. In our examples, we will specifically be connecting to csv files.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data on a GCS bucket
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Create a Datasource
We can define a GCS datasource by providing three pieces of information:
-
name
: In our example, we will name our Datasource"my_gcs_datasource"
-
bucket_or_name
: In this example, we will provide a GCS bucket -
gcs_options
: We can provide various additional options here, but in this example we will leave this empty and use the default values.
datasource_name = "version-0.16.16 my_gcs_datasource"
bucket_or_name = "version-0.16.16 my_bucket"
gcs_options = {}
Once we have those three elements, we can define our Datasource like so:
datasource = context.sources.add_spark_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
3. Add GCS data to the Datasource as a Data Asset
asset_name = "version-0.16.16 my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
gcs_prefix=gcs_prefix,
header=True,
infer_schema=True,
)
header
and
infer_schema
In the above example there are two parameters that
are optional, depending on the structure of your
file. If the file does not have a header line, the
header
parameter can be left out: it
will default to false
. Likewise, if
you do not want GX to infer the schema of your
file, you can leave off the
infer_schema
parameter; it will also
default to false
.
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name
with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which
will correspond to that file.
However, if you define a partial file name with a
regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one
corresponding to each matched file. You can then use
the keys year
and month
to
indicate exactly which file you want to request from
the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Additional information
External APIs
For more information on Google Cloud and authentication, please visit the following:
Related reading
For more details regarding storing credentials for use with GX, please see our guide: How to configure credentials