Skip to main content
Version: 0.16.16

How to connect to data on GCS using Pandas

In this guide we will demonstrate how to use Pandas to connect to data stored on Google Cloud Storage. In our examples, we will specifically be connecting to csv files. However, Great Expectations supports most types of files that Pandas has read methods for.

Prerequisites

Steps

1. Import GX and instantiate a Data Context

The code to import Great Expectations and instantiate a Data Context is:

import great_expectations as gx

context = gx.get_context()

2. Create a Datasource

We can define a GCS datasource by providing three pieces of information:

  • name: In our example, we will name our Datasource "my_gcs_datasource"
  • bucket_or_name: In this example, we will provide a GCS bucket
  • gcs_options: We can provide various additional options here, but in this example we will leave this empty and use the default values.
datasource_name = "version-0.16.16 my_gcs_datasource"
bucket_or_name = "version-0.16.16 my_bucket"
gcs_options = {}

Once we have those three elements, we can define our Datasource like so:

datasource = context.sources.add_pandas_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

3. Add GCS data to the Datasource as a Data Asset

asset_name = "version-0.16.16 my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, gcs_prefix=gcs_prefix
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your GCS bucket has the following files:

  • "yellow_tripdata_sample_2021-11.csv"
  • "yellow_tripdata_sample_2021-12.csv"
  • "yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

Additional information

External APIs

For more information on Google Cloud and authentication, please visit the following:

For more details regarding storing credentials for use with GX, please see our guide: How to configure credentials