How to configure a RuntimeDataConnector
This guide demonstrates how to configure a
RuntimeDataConnector and only applies to the V3 (Batch
Request) API. A
RuntimeDataConnector
allows you to
specify a
BatchA selection of records from a Data Asset.
using a Runtime
Batch RequestProvided to a Datasource in order to create a
Batch., which is used to create a Validator. A
ValidatorUsed to run an Expectation Suite against
data.
is the key object used to create
ExpectationsA verifiable assertion about data.
and
ValidateThe act of applying an Expectation Suite to a
Batch.
datasets.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Understand the basics of Datasources in the V3 (Batch Request) API
- Learned how to configure a Data Context using test_yaml_config
A RuntimeDataConnector is a special kind of
Data Connector
that enables you to use a RuntimeBatchRequest to
provide a
Batch's
data directly at runtime. The RuntimeBatchRequest can
wrap an in-memory dataframe, a filepath, or a SQL
query, and must include batch identifiers that
uniquely identify the data (e.g. a
run_id
from an AirFlow DAG run). The
batch identifiers that must be passed in at runtime
are specified in the RuntimeDataConnector's
configuration.
Steps
1. Instantiate your project's DataContext
Import these necessary packages and modules:
- YAML
- Python
from ruamel import yaml
import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest
context = gx.get_context()
import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest
context = gx.get_context()
2. Set up a Datasource
All of the examples below assume you’re testing configuration using something like:
- YAML
- Python
datasource_yaml = """
name: taxi_datasource
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
<DATACONNECTOR NAME GOES HERE>:
<DATACONNECTOR CONFIGURATION GOES HERE>
"""
context.test_yaml_config(yaml_config=datasource_config)
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"<DATACONNECTOR NAME GOES HERE>": {
"<DATACONNECTOR CONFIGURATION GOES HERE>"
},
},
}
context.test_yaml_config(yaml.dump(datasource_config))
If you’re not familiar with the
test_yaml_config
method, please check
out:
How to configure Data Context components using
test_yaml_config
3. Add a RuntimeDataConnector to a Datasource configuration
This basic configuration can be used in multiple ways
depending on how the
RuntimeBatchRequest
is configured:
- YAML
- Python
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
"""
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
},
}
Once the RuntimeDataConnector is configured you can add your DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. using:
context.add_datasource(**datasource_config)
Example 1: RuntimeDataConnector for access to file-system data:
At runtime, you would get a Validator from the
Data ContextThe primary entry point for a Great Expectations
deployment, with configurations and methods for
all supporting components.
by first defining a
RuntimeBatchRequest
with the
path
to your data defined in
runtime_parameters
:
batch_request = RuntimeBatchRequest(
datasource_name="version-0.15.50 taxi_datasource",
data_connector_name="version-0.15.50 default_runtime_data_connector_name",
data_asset_name="version-0.15.50 YOUR_MEANINGFUL_NAME", # This can be anything that identifies this data_asset for you
runtime_parameters={"path": "PATH_TO_YOUR_DATA_HERE"}, # Add your path here.
batch_identifiers={"default_identifier_name": "YOUR_MEANINGFUL_IDENTIFIER"},
)
Next, you would pass that request into
context.get_validator
:
validator = context.get_validator(
batch_request=batch_request,
create_expectation_suite_with_name="version-0.15.50 MY_EXPECTATION_SUITE_NAME",
)
print(validator.head())
Example 2: RuntimeDataConnector that uses an in-memory DataFrame
At runtime, you would get a Validator from the Data
Context by first defining a
RuntimeBatchRequest
with the DataFrame
passed into batch_data
in
runtime_parameters
:
import pandas as pd
path = "PATH_TO_YOUR_DATA_HERE"
df = pd.read_csv(path)
batch_request = RuntimeBatchRequest(
datasource_name="version-0.15.50 taxi_datasource",
data_connector_name="version-0.15.50 default_runtime_data_connector_name",
data_asset_name="version-0.15.50 YOUR_MEANINGFUL_NAME", # This can be anything that identifies this data_asset for you
runtime_parameters={"batch_data": df}, # Pass your DataFrame here.
batch_identifiers={"default_identifier_name": "YOUR_MEANINGFUL_IDENTIFIER"},
)
Next, you would pass that request into
context.get_validator
:
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="version-0.15.50 MY_EXPECTATION_SUITE_NAME",
)
print(validator.head())
Additional Notes
To view the full script used in this page, see it on GitHub: