Batch
A Batch is a selection of records from a Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification..
A Batch provides a consistent interface for describing specific data from any DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems., to support building MetricsA computed attribute of data such as the mean of a column., ValidationThe act of applying an Expectation Suite to a Batch., and ProfilingThe act of generating Metrics and candidate Expectations from data..
Batches are designed to be "MECE" -- mutually exclusive and collectively exhaustive partitions of Data Assets. However, in many cases the same underlying data could be present in multiple batches, for example if an analyst runs an analysis against an entire table of data each day, with only a fraction of new records being added.
Consequently, the best way to understand what "makes a Batch a Batch" is the act of attending to it. Once you have defined how a Datasource's data should be sliced (even if that is to define a single slice containing all of the data in the Data Asset), you have determined what makes those particular Batches "a Batch." The Batch is the fundamental unit that Great Expectations will validate and about which it will collect metrics.
Relationship to other objects
A Batch is generated by providing a Batch RequestProvided to a Datasource in order to create a Batch. to a Data Asset. It provides a reference to interact with the data through the Data Asset and adds metadata to precisely identify the specific data included in the Batch.
ProfilersGenerates Metrics and candidate Expectations from data. use Batches to generate Metrics and potential ExpectationsA verifiable assertion about data. based on the data. Batches make it possible for the Profiler to compare data over time and sample from large datasets to improve performance.Metrics are always associated with a Batch of data. The identifier for the Batch is the primary way that Great Expectations identifies what data to use when computing a Metric and how to store that Metric.
Batches are also used by ValidatorsUsed to run an Expectation Suite against data. when they run an Expectation Suite against data.
Use Cases
When creating Expectations interactively, a ValidatorUsed to run an Expectation Suite against data. needs access to a specific Batch of data against which to check Expectations. The how to guide on interactively creating expectations covers using a Batch in this use case.
During Validation, a CheckpointThe primary means for validating data in a production deployment of Great Expectations. checks a Batch of data against Expectations from an Expectation SuiteA collection of verifiable assertions about data.. You must specify a Batch Request for the Checkpoint to run.
Consistency
A Batch is always part of a Data Asset. A Data Asset can be configured to slice its data into batches in many ways. For example, it can be based on an arbitrary field, including datetimes, from the data.
A Batch is always built using a Batch Request. See Batch Request or a specific connecting to data guide.
Once a Data Asset identifies the specific data that will be included in a Batch based on the Batch Request, it creates a reference to the data and adds metadata to including the parameters used in the Batch Request.
Access
You can access a Batch through the Data Asset
get_batch_list_from_batch_request
method.
You typically will not need to access the Batch
directly. Instead, you will pass a Batch Request to a
Expectations object such as a Profiler, Validator, or
Checkpoint, which will then do something in response
to the Batch's data.
Create
The BatchRequest
object is the primary
API used to construct Batches. You construct a Batch
Request that corresponds to a batch via the Data
Asset's method build_batch_request
.
- For more information, see our documentation on Batch Requests.
Instantiating a Batch does not necessarily “fetch” the data by immediately running a query or pulling data into memory. Instead, think of a Batch as a wrapper that includes the information that you will need to fetch the right data when it’s time to Validate.