Profiler
A Profiler generates MetricsA computed attribute of data such as the mean of a column. and candidate ExpectationsA verifiable assertion about data. from data.
A Profiler creates a starting point for quickly generating Expectations.
There are several Profilers included with Great Expectations; conceptually, each Profiler is a checklist of questions which will generate an Expectation Suite when asked of a Batch of data.
Relationship to other objects
A Profiler builds an Expectation Suite from one or more Data Assets. Many Profiler workflows will also include a step that ValidatesThe act of applying an Expectation Suite to a Batch. the data against the newly-generated Expectation Suite to return a Validation ResultGenerated when data is Validated against an Expectation or Expectation Suite..
Use cases
Profilers come into use when it is time to configure Expectations for your project. At this point in your workflow you can configure a new Profiler, or use an existing one to generate Expectations from a BatchA selection of records from a Data Asset. of data.
For details on how to configure a customized Profiler, see our guide on how to create a new expectation suite using a Profiler.
Profiler types
There are multiple types of Profilers built in to
Great Expectations. Below is a list with overviews of
each one. For more information, you can view their
docstrings and source code in the
great_expectations\profile
folder on our GitHub.
Profiler
The Custom Profiler allows you to directly configure a
customized Profiler through a YAML configuration.
Profilers allow you to integrate organizational
knowledge about your data into the profiling process.
For example, a team might have a convention that all
columns named "id" are
primary keys, whereas all columns ending with the
suffix "_id" are foreign
keys. In that case, when the team using Great
Expectations first encounters a new dataset that
followed the convention, a Profiler could use that
knowledge to add an
expect_column_values_to_be_unique
Expectation to the "id" column (but not, for
example an "address_id" column).
For details on how to configure a customized Profiler, see our guide on how to create a new expectation suite using a Profiler.
Create
It is unlikely that you will need to create a customized Profiler by extending an existing Profiler with a subclass. Instead, you should work with a Profiler which can be fully configured in a YAML configuration file.
Configuring a custom Profiler is covered in the following section. See also How to create a new expectation suite using a Profiler, or the full source code for that guide on our GitHub as an example.
Configure Profilers
Profilers allow users to provide a highly configurable specification which is composed of Rules to use in order to build an Expectation Suite by profiling existing data.
Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.
A Rule in a Profiler could say
something like "Look at every column in my Sales
table, and if that column is numeric, add an
expect_column_values_to_be_between
Expectation to my Expectation Suite, where the
min_value
for the Expectation is the
minimum value for the column, and the
max_value
for the Expectation is the
maximum value for the column."
Each rule in a Profiler has three types of components:
- DomainBuilders: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
- ParameterBuilders: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
- ExpectationConfigurationBuilders: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
- Your DomainBuilder would inspect all twenty columns, and then yield a list of the five numeric columns
-
You would specify two
ParameterBuilders: one which gets
the min of a column, and one which gets a max. Your
Profiler would loop over the Domain (or column) list
built by the DomainBuilder and use
the two
ParameterBuilders
to get the min and max for each column. -
Then the Profiler loops over Domains built by the
DomainBuilder
and uses the ExpectationConfigurationBuilders to add aexpect_column_values_to_between
column for each of these Domains, where themin_value
andmax_value
are the values that we got in theParameterBuilders
.
In addition to Rules, a Profiler enables you to
specify Variables, which are global
and can be used in any of the Rules. For instance, you
may want to reference the same
BatchRequest
or the same tolerance in
multiple Rules, and declaring these as Variables will
enable you to do so.
Below is an example configuration based on this discussion:
variables:
my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
datasource_name: my_sales_datasource
data_connector_name: monthly_sales
data_asset_name: sales_data
data_connector_query:
index: -1
mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
rules:
my_rule_for_numeric_columns: # This is the name of our Rule
domain_builder:
batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
semantic_types:
- numeric
parameter_builders:
- parameter_name: my_column_min
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.min # This is the metric we want to get with this ParameterBuilder
metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
- parameter_name: my_column_max
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
expectation_configuration_builders:
- expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
class_name: DefaultExpectationConfigurationBuilder
column: $domain.domain_kwargs.column
min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
max_value: $parameter.my_column_max.value
mostly: $variables.mostly_default