Data Assistant
Overview
Definition
A Data Assistant is a utility that asks questions about your data, gathering information to describe what is observed, and then presents MetricsA computed attribute of data such as the mean of a column. and proposes ExpectationsA verifiable assertion about data. based on the answers.
Features and promises
Data Assistants allow you to introspect multiple BatchesA selection of records from a Data Asset. and create an Expectation SuiteA collection of verifiable assertions about data. from the aggregated Metrics of those Batches. They provide convenient, visual representations of the generated Expectations to assist with identifying outliers in the corresponding parameters. They are convenient to access from your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components., and provide an excellent starting point for building Expectations or performing initial data exploration.
Relationships to other objects
A Data Assistant implements a pre-configured Rule Based ProfilerGenerates Metrics and candidate Expectations from data. in order to gather Metrics and propose an Expectation Suite based on the introspection of the Batch or Batches contained in a provided Batch RequestProvided to a Datasource in order to create a Batch..
Use cases
Create Expectations |
Data Assistants are an ideal starting point for
creating your Expectations. If you are working with
data that you are not familiar with, a Data Assistant
can give you an overview by introspecting it and
generating a series of relevant Expectations using
estimated parameters for you to review. If you use the
"flag_outliers"
value for the
estimation
parameter your generated
Expectations will have parameters that disregard
values that the Data Assistant identifies as outliers.
Using the Data Assistant's
plot_metrics()
method will then give you
a graphical representation of the generated
Expectations. This will further assist you in spotting
outliers in your data when reviewing the Data
Assistant's results.
Even when working with data that you are familiar with
and know is good, a Data Assistant can use the
"exact"
value for the
estimation
parameter to provide
comprehensive Expectations that exactly reflect the
values found in the provided data.
Features
Easy profiling
Data Assistants implement pre-configured Rule-Based
Profilers under the hood, but also provide extended
functionality. They are easily accessible: You can
call them directly from your Data Context. This
ensures that they will always provide a quick, simple
entry point to creating Expectations and
ProfilingThe act of generating Metrics and candidate
Expectations from data.
your data. However, the rules implemented by a Data
Assistant are also fully exposed in the parameters for
its run(...)
method. This means that
while you can use a Data Assistant easily out of the
box, you can also customize it behavior to take
advantage of the domain knowledge possessed by
subject-matter experts.
Multi-Batch introspection
Data Assistants leverage the ability to process multiple Batches from a single Batch Request to provide a representative analysis of the provided data. With previous Profilers you would only be able to introspect a single Batch at a time. This meant that the Expectation Suite generated would only reflect a single Batch. If you had many Batches of data that you wanted to build inter-related Expectations for, you would have needed to run each Batch individually and then manually compare and update the Expectation parameters that were generated. With a Data Assistant, that process is automated. You can provide a Data Assistant multiple Batches and get back Expectations that have parameters based on, for instance, the mean or median value of a column on a per-Batch basis.
Visual plots for Metrics
When working in a Jupyter Notebook you can use the
plot_metrics()
method of a Data
Assistant's result object to generate a visual
representation of your Expectations, the values that
were assigned to their parameters, and the Metrics
that informed those values. This assists in
exploratory data analysis and fine-tuning your
Expectations, while providing complete transparency
into the information used by the Data Assistant to
build your Expectations.
API basics
Data Assistants can be easily accessed from your Data
Context. In a Jupyter Notebook, you can enter
context.assistants.
and use code
completion to select the Data Assistant you wish to
use. All Data Assistants have a
run(...)
method that takes in a Batch
Request and numerous optional parameters, the results
of which can be loaded into an Expectation Suite for
future use.
The Onboarding Data Assistant is an ideal starting
point for working with Data Assistants. It can be
accessed from
context.assistants.onboarding
, or from
the
CLICommand Line Interface
command
great_expectations suite new --profile
.
Configuration
Data Assistants come pre-configured! All you need to
provide is a Batch Request, and some optional
parameters in the Data Assistant's
run(...)
method.
More details
Design motivation
Data Assistants were designed to make creating Expectations easier for users of Great Expectations. A Data Assistant will help solve the problem of "where to start" when working with a large, new, or complex dataset by greedily asking questions according to a set theme and then building a list of all the relevant Metrics that it can determine from the answers to those questions. Branching question paths ensure that additional relevant Metrics are gathered on the groundwork of the earlier questions asked. The result is a comprehensive gathering of Metrics that can then be saved, reviewed as graphical plots, or used by the Data Assistant to generate a set of proposed Expectations.
Additional documentation
Data Assistants are multi-batch aware out of the box. However, not every use case requires multiple Batches. For more information on when it is best to work with either a single Batch or multiple Batches of data in a Batch Request, please see the following guide:
To take advantage of the multi-batch awareness of Data Assistants, your DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. need to be configured so that you can acquire multiple Batches in a single Batch Request. For guidance on how to configure your Datasources to be capable of returning multiple Batches, please see the following documentation that matches the Datasource type you are working with:
- How to configure a Pandas Datasource
- How to configure a Spark Datasource
- How to configure a SQL Datasource
For guidance on how to request multiple Batches in a single Batch Request, please see the guide:
For an overview of working with the Onboarding Data Assistant, please see the guide: