Profiler
Overview
Definition
A Profiler generates MetricsA computed attribute of data such as the mean of a column. and candidate ExpectationsA verifiable assertion about data. from data.
Features and promises
A Profiler creates a starting point for quickly
generating Expectations. For example, during the
Getting Started Tutorial, Great Expectations uses the
UserConfigurableProfiler
to demonstrate
important features of Expectations by creating and
validating an
Expectation SuiteA collection of verifiable assertions about
data.
that has several kinds of Expectations built from a
small sample of data.
There are several Profilers included with Great Expectations; conceptually, each Profiler is a checklist of questions which will generate an Expectation Suite when asked of a Batch of data.
Relationship to other objects
A Profiler builds an Expectation Suite from one or more Data Assets. Many Profiler workflows will also include a step that ValidatesThe act of applying an Expectation Suite to a Batch. the data against the newly-generated Expectation Suite to return a Validation ResultGenerated when data is Validated against an Expectation or Expectation Suite..
Use cases
Create Expectations |
Profilers come into use when it is time to configure Expectations for your project. At this point in your workflow you can configure a new Profiler, or use an existing one to generate Expectations from a BatchA selection of records from a Data Asset. of data.
For details on how to configure a customized Rule-Based Profiler, see our guide on how to create a new expectation suite using Rule-Based Profilers.
For instructions on how to use a
UserConfigurableProfiler
to generate
Expectations from data, see our guide on
how to create and edit Expectations with a
Profiler.
Features
Multiple types of Profilers available
There are multiple types of Profilers built in to
Great Expectations. Below is a list with overviews of
each one. For more information, you can view their
docstrings and source code in the
great_expectations\profile
folder on our GitHub.
UserConfigurableProfiler
The UserConfigurableProfiler
is used to
build an Expectation Suite from a dataset. The
Expectations built are strict - they can be used to
determine whether two tables are the same. When these
Profilers are instantiated they can be configured by
providing one or more input configuration parameters,
allowing you to rapidly create a Profiler without
needing to edit configuration files. However, if you
need to change these parameters you will also need to
instantiate a new
UserConfigurableProfiler
using the
updated parameters.
For instructions on how to use a
UserConfigurableProfiler
to generate
Expectations from data, see our guide on
how to create and edit Expectations with a
Profiler.
JsonSchemaProfiler
The JsonSchemaProfiler
creates
Expectation Suites from JSONSchema artifacts. Basic
suites can be created from these specifications.
-
There is not yet a notion of nested data types
in Great Expectations so suites generated by a
JsonSchemaProfiler
use column map expectations. -
A
JsonSchemaProfiler
does not traverse nested schemas and requires a top level object of typeobject
.
For an example of how to use the
JsonSchemaProfiler
, see our guide on
how to create a new Expectation Suite by profiling
from a JsonSchema file.
Rule-Based Profiler
Rule-Based Profilers are a newer implementation of
Profiler that allows you to directly configure the
Profiler through a YAML configuration. Rule-Based
Profilers allow you to integrate organizational
knowledge about your data into the profiling process.
For example, a team might have a convention that all
columns named "id" are
primary keys, whereas all columns ending with the
suffix "_id" are foreign
keys. In that case, when the team using Great
Expectations first encounters a new dataset that
followed the convention, a Profiler could use that
knowledge to add an
expect_column_values_to_be_unique
Expectation to the "id" column (but not, for
example an "address_id" column).
For details on how to configure a customized Rule-Based Profiler, see our guide on how to create a new expectation suite using Rule-Based Profilers.
API basics
How to access
The recommended workflow for Profilers is to use the
UserConfigurableProfiler
. Doing so can be
as simple as importing it and instantiating a copy by
passing a
ValidatorUsed to run an Expectation Suite against
data.
to the class, like so:
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
profiler = UserConfigurableProfiler(profile_dataset=validator)
There are additional parameters that can be passed to
a UserConfigurableProfiler
, all of which
are either optional or have a default value. These
consist of:
- excluded_expectations: A list of Expectations to not include in the suite.
- ignored_columns: A list of columns for which you would like to NOT create Expectations.
-
not_null_only: Boolean, default
False. By default, each column is evaluated for
nullity. If the column values contain fewer than 50%
null values, then the Profiler will add
expect_column_values_to_not_be_null
; if greater than 50% it will addexpect_column_values_to_be_null
. Ifnot_null_only
is set toTrue
, the Profiler will add a not_null Expectation irrespective of the percent nullity (and therefore will not add anexpect_column_values_to_be_null
). -
primary_or_compound_key: A list
containing one or more columns which are a
dataset's primary or compound key. This will
create an
expect_column_values_to_be_unique
orexpect_compound_columns_to_be_unique
expectation. This will occur even if one or more of theprimary_or_compound_key
columns are specified inignored_columns
. -
semantic_types_dict: A dictionary
where the keys are available
semantic_types
(see profiler.base.ProfilerSemanticTypes) and the values are lists of columns for which you would like to createsemantic_type
specific Expectations e.g.:"semantic_types": { "value_set": ["state","country"], "numeric":["age", "amount_due"]}
. -
table_expectations_only: Boolean,
default False. If True, this will only create the
two table level Expectations available to this
Profiler (
expect_table_columns_to_match_ordered_list
andexpect_table_row_count_to_be_between
). If aprimary_or_compound_key
is specified, it will create a uniqueness Expectation for that column as well. -
value_set_threshold: Takes a string
from the following ordered list - "none",
"one", "two",
"very_few", "few",
"many", "very_many",
"unique". When the Profiler runs without a
semantic_types dict, each column is profiled for
cardinality. This threshold determines the greatest
cardinality for which to add
expect_column_values_to_be_in_set
. For example, ifvalue_set_threshold
is set to "unique", it will add a value_set Expectation for every included column. If set to "few", it will add a value_set Expectation for columns whose cardinality is one of "one", "two", "very_few" or "few". The default value is "many". For the purposes of comparing whether two tables are identical, it might make the most sense to set this to "unique".
How to create
It is unlikely that you will need to create a custom Profiler by extending an existing Profiler with a subclass. Instead, you should work with a Rule-Based Profiler which can be fully configured in a YAML configuration file.
Configuring a custom Rule-Based Profiler is covered in more detail in the Configuration section below. You can also read our guide on how to create a new expectation suite using Rule-Based Profilers to be walked through the process, or view the full source code for that guide on our GitHub as an example.
Configuration
Rule-Based Profilers
Rule-Based Profilers allow users to provide a highly configurable specification which is composed of Rules to use in order to build an Expectation Suite by profiling existing data.
Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.
A Rule in a Rule-Based Profiler could
say something like "Look at every column in my
Sales table, and if that column is numeric, add an
expect_column_values_to_be_between
Expectation to my Expectation Suite, where the
min_value
for the Expectation is the
minimum value for the column, and the
max_value
for the Expectation is the
maximum value for the column."
Each rule in a Rule-Based Profiler has three types of components:
- DomainBuilders: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
- ParameterBuilders: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
- ExpectationConfigurationBuilders: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
- Your DomainBuilder would inspect all twenty columns, and then yield a list of the five numeric columns
-
You would specify two
ParameterBuilders: one which gets
the min of a column, and one which gets a max. Your
Profiler would loop over the Domain (or column) list
built by the DomainBuilder and use
the two
ParameterBuilders
to get the min and max for each column. -
Then the Profiler loops over Domains built by the
DomainBuilder
and uses the ExpectationConfigurationBuilders to add aexpect_column_values_to_between
column for each of these Domains, where themin_value
andmax_value
are the values that we got in theParameterBuilders
.
In addition to Rules, a Rule-Based Profiler enables
you to specify Variables, which are
global and can be used in any of the Rules. For
instance, you may want to reference the same
BatchRequest
or the same tolerance in
multiple Rules, and declaring these as Variables will
enable you to do so.
Below is an example configuration based on this discussion:
variables:
my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
datasource_name: my_sales_datasource
data_connector_name: monthly_sales
data_asset_name: sales_data
data_connector_query:
index: -1
mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
rules:
my_rule_for_numeric_columns: # This is the name of our Rule
domain_builder:
batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
semantic_types:
- numeric
parameter_builders:
- parameter_name: my_column_min
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.min # This is the metric we want to get with this ParameterBuilder
metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
- parameter_name: my_column_max
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
expectation_configuration_builders:
- expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
class_name: DefaultExpectationConfigurationBuilder
column: $domain.domain_kwargs.column
min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
max_value: $parameter.my_column_max.value
mostly: $variables.mostly_default