How to connect to data on Azure Blob Storage using Pandas
This guide will help you connect to your data stored on Microsoft Azure Blob Storage (ABS) using Pandas. This will allow you to ValidateThe act of applying an Expectation Suite to a Batch. and explore your data.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Have access to data on an ABS container
Steps
1. Choose how to run the code in this guide
Get an environment to run the code in this guide. Please choose an option below.
- CLI + filesystem
- No CLI + filesystem
- No CLI + no filesystem
If you use the Great Expectations CLICommand Line Interface, run this command to automatically generate a pre-configured Jupyter Notebook. Then you can follow along in the YAML-based workflow below:
great_expectations datasource new
If you use Great Expectations in an environment that has filesystem access, and prefer not to use the CLICommand Line Interface, run the code in this guide in a notebook or other Python script.
If you use Great Expectations in an environment that has no filesystem (such as Databricks or AWS EMR), run the code in this guide in that system's preferred way.
2. Instantiate your project's DataContext
Import these necessary packages and modules.
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import Batch, BatchRequest
Load your DataContext into memory using the
get_context()
method.
context = ge.get_context()
3. Configure your Datasource
Great Expectations provides two types of
Data ConnectorProvides the configuration details based on the
source data system which are needed by a
Datasource to define Data Assets.
classes for connecting to ABS:
InferredAssetAzureDataConnector
and
ConfiguredAssetAzureDataConnector
-
An
InferredAssetAzureDataConnector
utilizes regular expressions to inferdata_asset_names
by evaluating filename patterns that exist in your bucket. ThisDataConnector
, along with aRuntimeDataConnector
, is provided as a default when utilizing our Jupyter Notebooks. -
A
ConfiguredAssetAzureDataConnector
requires an explicit listing of each Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. you want to connect to. This allows for more granularity and control than itsInferred
counterpart but also requires a more complex setup.
As the InferredAssetDataConnectors
have
fewer options and are generally simpler to use, we
recommend starting with them.
We've detailed example configurations for both options in the next section for your reference.
It is also important to note that the ABS
DataConnectors
for Pandas support two
(mutually exclusive) methods of authentication.
You should be aware of the following options when
configuring your own environment:
-
account_url
key in theazure_options
dictionary- This is the default option and what is used throughout this guide.
-
conn_str
key in theazure_options
dictionary -
In all cases, the
AZURE_CREDENTIAL
environment variable is required.
The azure_options
dictionary is also
responsible for storing any
**kwargs
you wish to pass to the ABS
BlobServiceClient
connection object.
For more details regarding storing credentials for use with Great Expectations see: How to configure credentials
For more details regarding authentication and
access using Python
, please visit the
following:
Using these example configurations, add in your ABS container and path to a directory that contains some of your data:
- Inferred + Runtime (Default)
- Configured
The below configuration is representative of the default setup you'll see when preparing your own environment.
- YAML
- Python
The below configuration is highly tuned to the specific bucket and blobs relevant to this example. You'll have to fine-tune your own regular expressions and assets to fit your use-case.
- YAML
- Python
context = ge.get_context()
datasource_yaml = f"""
name: my_azure_datasource
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
azure_options:
account_url: <YOUR_ACCOUNT_URL> # or `conn_str`
credential: <YOUR_CREDENTIAL> # if using a protected container
data_connectors:
configured_data_connector_name:
class_name: ConfiguredAssetAzureDataConnector
azure_options:
account_url: <YOUR_ACCOUNT_URL> # or `conn_str`
credential: <YOUR_CREDENTIAL> # if using a protected container
container: <YOUR_AZURE_CONTAINER_HERE>
Run this code to test your configuration.
# Please note this override is only to provide good UX for docs and tests.
context = ge.get_context()
datasource_config = {
"name": "my_azure_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"azure_options": {
"account_url": "<YOUR_ACCOUNT_URL>",
"credential": "<YOUR_CREDENTIAL>",
},
},
"data_connectors": {
"configured_data_connector_name": {
"class_name": "ConfiguredAssetAzureDataConnector",
"azure_options": {
"account_url": "<YOUR_ACCOUNT_URL>",
Run this code to test your configuration.
},
If you specified an ABS path containing CSV files you
will see them listed as
Available data_asset_names
in the output
of test_yaml_config()
.
Feel free to adjust your configuration and re-run
test_yaml_config()
as needed.
4. Save the Datasource configuration to your DataContext
Save the configuration into your
DataContext
by using the
add_datasource()
function.
- YAML
- Python
context.add_datasource(**yaml.load(datasource_yaml))
context.add_datasource(**datasource_config)
5. Test your new Datasource
Verify your new DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. by loading data from it into a ValidatorUsed to run an Expectation Suite against data. using a Batch RequestProvided to a Datasource in order to create a Batch..
Add the name of the data asset to the
data_asset_name
in your
BatchRequest
.
batch_request = BatchRequest(
datasource_name="my_azure_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="<YOUR_DATA_ASSET_NAME>",
)
Then load data into the Validator
.
context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="test_suite"
)
🚀🚀 Congratulations! 🚀🚀 You successfully connected Great Expectations with your data.
Additional Notes
If you are working with nonstandard CSVs, read one of these guides:
- How to work with headerless CSVs in pandas
- How to work with custom delimited CSVs in pandas
- How to work with parquet files in pandas
To view the full scripts used in this page, see them on GitHub:
- inferred_and_runtime_yaml_example.py
- inferred_and_runtime_python_example.py
- configured_yaml_example.py
- configured_python_example.py
To review the source code of these
DataConnectors
, also visit GitHub: