How to connect to data on Azure Blob Storage using Pandas
In this guide we will demonstrate how to use Pandas to connect to data stored in Azure Blob Storage. In this example we will specifically be connecting to data in csv format. However, GX supports most read methods available through Pandas.
Prerequisites
- A working installation of Great Expectations with dependencies for Azure Blob Storage
- Access to data in Azure Blob Storage
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Create a Datasource
We can define an Azure Blob Storage datasource by providing these pieces of information:
-
name
: In our example, we will name our Datasource"my_datasource"
-
azure_options
: We provide authentication settings here
datasource_name = "version-0.16.16 my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
}
We can create a Datasource that points to our Azure Blob Storage with the code:
datasource = context.sources.add_pandas_abs(
name=datasource_name, azure_options=azure_options
)
In the above example, the value for
account_url
will be substituted for
the contents of the
AZURE_STORAGE_CONNECTION_STRING
key
you configured when you
installed GX and set up your Azure Blob Storage
dependancies.
3. Add ABS data to the Datasource as a Data Asset
To specify data to connect to you will need the following elements:
-
name
: A name by which you can reference the Data Asset in the future. -
batching_regex
: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them. -
abs_container
: The name of your Azure Blob Storage container. -
abs_name_starts_with
: A string indicating what part of thebatching_regex
to truncate from the final batch names.
asset_name = "version-0.16.16 my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
Once these values have been defined, we will create our DataAsset with the code:
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your Azure Blob Storage container has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name
with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which
will correspond to that file.
However, if you define a partial file name with a
regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one
corresponding to each matched file. You can then use
the keys year
and month
to
indicate exactly which file you want to request from
the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Additional information
Related reading
For more details regarding storing credentials for use with GX, please see our guide: How to configure credentials