How to collect OpenLineage metadata using an Action
OpenLineage is an open framework for collection and analysis of data lineage. It tracks the movement of data over time, tracing relationships between datasets. Data engineers can use data lineage metadata to determine the root cause of failures, identify performance bottlenecks, and simulate the effects of planned changes.
Enhancing the metadata in OpenLineage with results from an Expectation SuiteA collection of verifiable assertions about data. makes it possible to answer questions like:
- have there been failed assertions in any upstream datasets?
- what jobs are currently consuming data that is known to be of poor quality?
- is there something in common among failed assertions that seem otherwise unrelated?
This guide will explain how to use an ActionA Python class with a run method that takes a Validation Result and does something with it to emit results to an OpenLineage backend, where their effect on related datasets can be studied.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Created at least one Expectation Suite
- Created at least one Checkpoint - you will need it in order to test that the OpenLineage ValidationThe act of applying an Expectation Suite to a Batch. is working.
Steps
1. Ensure that the
openlineage-common
package has been
installed in your Python environment.
% pip3 install openlineage-common
2. Update the action_list
key in your
Validation Operator config.
Add the
OpenLineageValidationAction
action to the
action_list
key of the
ActionListValidationOperator
config in
your great_expectations.yml
.
validation_operators:
action_list_operator:
class_name: ActionListValidationOperator
action_list:
- name: openlineage
action:
class_name: OpenLineageValidationAction
module_name: openlineage.common.provider.great_expectations
openlineage_host: ${OPENLINEAGE_URL}
openlineage_apiKey: ${OPENLINEAGE_API_KEY}
job_name: ge_validation # This is user-definable
openlineage_namespace: ge_namespace # This is user-definable
The openlineage_host
and
openlineage_apiKey
values can be set via
the environment, as shown above, or can be implemented
as variables in
uncommitted/config_variables.yml
. The
openlineage_apiKey
value is optional, and
is not required by all OpenLineage backends.
A Great Expecations
CheckpointThe primary means for validating data in a
production deployment of Great Expectations.
is recorded as a Job in OpenLineage, and will be named
according to the job_name
value.
Similarly, the
openlineage_namespace
value can be
optionally set. For more information on job naming,
consult the
Naming section
of the OpenLineage spec.
3. Test your Action by Validating a Batch of data.
Trigger your action_list_operator
to
Validate a
BatchA selection of records from a Data Asset.
of data and emit lineage events to the OpenLineage
backend. This can be done in code:
context.run_validation_operator('action_list_operator', assets_to_validate=batch, run_name="openlineage_test")
Alteratively, this can be done with a Checkpoint.
First, make sure that the
validation_operator_name
is set in your
checkpoint's XML file:
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
+validation_operator_name: action_list_operator
batches:
- batch_kwargs:
Then, run the Checkpoint:
% great_expectations checkpoint run <checkpoint_name>