Batch
Overview
Definition
A Batch is a selection of records from a Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification..
Features and promises
A Batch provides a consistent interface for describing specific data from any DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems., to support building MetricsA computed attribute of data such as the mean of a column., ValidationThe act of applying an Expectation Suite to a Batch., and ProfilingThe act of generating Metrics and candidate Expectations from data..
Relationship to other objects
A Batch is generated by providing a Batch RequestProvided to a Datasource in order to create a Batch. to a Datasource. It provides a reference to interact with the data through the Datasource and adds metadata to precisely identify the specific data included in the Batch.
ProfilersGenerates Metrics and candidate Expectations from data. use Batches to generate Metrics and potential ExpectationsA verifiable assertion about data. based on the data. Batches make it possible for the Profiler to compare data over time and sample from large datasets to improve performance.Metrics are always associated with a Batch of data. The identifier for the Batch is the primary way that Great Expectations identifies what data to use when computing a Metric and how to store that Metric.
Batches are also used by ValidatorsUsed to run an Expectation Suite against data. when they run an Expectation Suite against data.
Use Cases
Create Expectations |
When creating Expectations interactively, a ValidatorUsed to run an Expectation Suite against data. needs access to a specific Batch of data against which to check Expectations. The how to guide on interactively creating expectations covers using a Batch in this use case.
Our in-depth guide on how to create and edit expectations with a profiler covers how to specify which Batches of data should be used when using Great Expectations to generate statistics and candidate Expectations for your data.
Validate Data |
During Validation, a CheckpointThe primary means for validating data in a production deployment of Great Expectations. will check a Batch of data against Expectations from an Expectation SuiteA collection of verifiable assertions about data.. You must specify a Batch Request or provide a Batch of data at runtime for the Checkpoint to run.
Features
Consistent Interface for Describing Specific Data from any Datasource
A Batch is always part of a Data Asset. The Data Asset is sliced into Batches to correspond to the specification you define in a Data Connector, allowing you to define Batches of a Data Asset based on times from the data, pipeline runs, or the time of a Validation.
A Batch is always built using a Batch Request. The Batch Request includes a "query" for the Data Connector to describe the data that will be included in the Batch. The query makes it possible to create a Batch Request for the most recent Batch of data without defining the specific timeframe, for example.
Once a Datasource identifies the specific data that will be included in a Batch based on the Batch Request, it creates a reference to the data, and adds metadata including a Batch Definition, Batch Spec, and Batch Markers. That additional metadata is how Great Expectations identifies the Batch when accessing or storing Metrics.
API Basics
How to access
You will typically not need to access a Batch directly. Instead, you will pass it to a Great Expectations object such as a Profiler, Validator, or Checkpoint, which will then do something in response to the Batch's data.
How to create
The BatchRequest
object is the primary
API used to construct Batches. It is provided to the
get_validator
method on DataContext.
- For more information, see our documentation on Batch Requests.
Instantiating a Batch does not necessarily “fetch” the data by immediately running a query or pulling data into memory. Instead, think of a Batch as a wrapper that includes the information that you will need to fetch the right data when it’s time to Validate.
More details
Batches: Design Motivation
Batches are designed to be "MECE" -- mutually exclusive and collectively exhaustive partitions of Data Assets. However, in many cases the same underlying data could be present in multiple batches, for example if an analyst runs an analysis against an entire table of data each day, with only a fraction of new records being added.
Consequently, the best way to understand what "makes a Batch a Batch" is the act of attending to it. Once you have defined how a Datasource's data should be sliced (even if that is to define a single slice containing all of the data in the Datasource), you have determined what makes those particular Batches "a Batch." The Batch is the fundamental unit that Great Expectations will validate and about which it will collect metrics.
Batches and Batch Requests: Design Motivation
You do not generally need to access the metadata that Great Expectations uses to define a Batch. Typically, a user need specify only the Batch Request. The Batch Request will describe what data Great Expectations should fetch, including the name of the Data Asset and other identifiers (see more detail below).
A Batch Definition includes all the information required to precisely identify a set of data from the external data source that should be translated into a Batch. One or more BatchDefinitions are always returned from the Datasource, as a result of processing the Batch Request. A Batch Definition includes several key components:
- Batch Identifiers: contains information that uniquely identifies a specific batch from the Data Asset, such as the delivery date or query time.
- Engine Passthrough: contains information that will be passed directly to the Execution Engine as part of the Batch Spec.
- Sample Definition: contains information about sampling or limiting done on the Data Asset to create a Batch.
We recommend that you make every Data Asset Name unique in your Data Context configuration. Even though a Batch Definition includes the Data Connector Name and Datasource Name, choosing a unique Data Asset name makes it easier to navigate quickly through Data Docs and ensures your logical data assets are not confused with any particular view of them provided by an Execution Engine.
A Batch Spec is an Execution Engine-specific description of the Batch. The Data Connector is responsible for working with the Execution Engine to translate the Batch Definition into a spec that enables Great Expectations to access the data using that Execution Engine.
Finally, the BatchMarkers are additional pieces of metadata that can be useful to understand reproducibility, such as the time the batch was constructed, or hash of an in-memory DataFrame.
Batches and Batch Requests: A full journey
Let's follow the outline in this diagram to follow the journey from BatchRequest to Batch list:
-
A Datasource's
get_batch_list_from_batch_request
method is passed a BatchRequest.-
A BatchRequest can include
data_connector_query
params with values relative to the latest Batch (e.g. the "latest" slice). Conceptually, this enables "fetch the latest Batch" behavior. It is the key thing that differentiates a BatchRequest, which does NOT necessarily uniquely identify the Batch(es) to be fetched, from a BatchDefinition. -
The BatchRequest can also include a section
called
batch_spec_passthrough
to make it easy to directly communicate parameters to a specific Execution Engine. - When resolved, the BatchRequest may point to many BatchDefinitions and Batches.
- BatchRequests can be defined as dictionaries, or by instantiating a BatchRequest object.
-
A BatchRequest can include
runtime_batch_request = RuntimeBatchRequest(
datasource_name="version-0.15.50 my_pandas_datasource",
data_connector_name="version-0.15.50 my_runtime_data_connector",
data_asset_name="version-0.15.50 insert_your_data_asset_name_here",
runtime_parameters={"path": path_to_file},
batch_identifiers={
"some_key_maybe_pipeline_stage": "ingestion step 1",
"some_other_key_maybe_airflow_run_id": "run 18",
},
batch_spec_passthrough={
"reader_method": "read_csv",
"reader_options": {"sep": ",", "header": 0},
},
)
- The Datasource finds the Data Connector indicated by the BatchRequest, and uses it to obtain a BatchDefinition list.
DataSource.get_batch_list_from_batch_request(batch_request=batch_request)
- A BatchDefinition resolves any ambiguity in BatchRequest to uniquely identify a single Batch to be fetched. BatchDefinitions are Datasource -- and Execution Engine -- agnostic. That means that its parameters may depend on the configuration of the Datasource, but they do not otherwise depend on the specific Data Connector type (e.g. filesystem, SQL, etc.) or Execution Engine being used to instantiate Batches.
BatchDefinition
datasource: str
data_connector: str
data_asset_name: str
batch_identifiers:
** contents depend on the configuration of the DataConnector **
** provides a persistent, unique identifier for the Batch within the context of the Data Asset **
-
The Datasource then requests that the Data Connector transform the BatchDefinition list into BatchData, BatchSpec, and BatchMarkers.
-
When the Data Connector receives this request, it first builds the BatchSpec, then calls its Execution Engine to create BatchData and BatchMarkers.
-
A
BatchSpec
is a set of specific instructions for the Execution Engine to fetch specific data; it is the ExecutionEngine-specific version of the BatchDefinition. For example, aBatchSpec
could include the path to files, information about headers, or other configuration required to ensure the data is loaded properly for validation. - Batch Markers are metadata that can be used to calculate performance characteristics, ensure reproducibility of Validation Results, and provide indicators of the state of the underlying data system.
- After the Data Connector returns the BatchSpec, BatchData, and BatchMarkers, the Datasource builds and returns a list of Batches.