Version: 0.17.23

Identify failed rows in an Expectation

Quickly identifying problematic rows can reduce troubleshooting effort when an Expectation fails. After identifying the failed row, you can modify or remove the value in the table and run the Expectation again.

The following table shows the sample data that is used to demonstrate failed row identification. The sample data is webpage visitor events.

event_id	visit_id	date	event_type
0	1470408760	20220104	page_load
1	1470429409	20220104	page_load
2	1470441005	20220104	page_view
3	1470387700	20220104	user_signup
4	1470438716	20220104	purchase
5	1470420524	20220104	download

In the example code you'll run, you'll use a preconfigured Checkpoint named my_checkpoint that runs an Expectation Suite named visitors_exp that contains the Expectation type ExpectColumnValuesToBeInSet on the event_type column. To demonstrate an Expectation failure, you'll set the set parameter to ["page_load", "page_view]so the rows with user_signup, purchase, and download fail.

The example code is located in the primary_keys_in_validation_results GitHub repository.

Prerequisites

A working installation of Great Expectations.

pandas
Spark
SQLAlchemy

pandas

Identify failed rows using pandas.

Import the Great Expectations module and instantiate a Data Context

Use the get_context() method to create a new Data Context:

import great_expectations as gx

context = gx.get_context(project_root_dir=".")

Import the Checkpoint

Run the following code to import the example Checkpoint:

my_checkpoint = context.get_checkpoint("my_checkpoint")

Set the `unexpected_index_column_names` parameter

The failed rows are defined as values in the unexpected_index_column_names parameter. In the following example, you are setting the parameter to event_id to return the event_ids of the rows that fail the Expectation. However, unexpected_index_column_names can also return a list of columns that are representative of the failed rows. The Checkpoint also returns the unexpected_index_query, which you can use to retrieve the full set of failed results.

In the following example, you're setting the result_format to COMPLETE to return the full set of results. For more information about result_format, see Result format.

result_format: dict = {
    "result_format": "COMPLETE",
    "unexpected_index_column_names": ["event_id"],
}

Run Checkpoint using `result_format`

Run the following code to apply the updated result_format to every Expectation in your Suite and pass it directly to the run() method:

results = my_checkpoint.run(result_format=result_format)

Review Validation Results

After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.

The following is an example of the information returned after running the Checkpoint. For pandas, a filter condition on the DataFrame is included.

{
    "element_count": 6,
    "unexpected_count": 3,
    "unexpected_percent": 50.0,
    "partial_unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_column_names": ["event_id"],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 50.0,
    "unexpected_percent_nonmissing": 50.0,
    "partial_unexpected_index_list": [
        {"event_type": "user_signup", "event_id": 3},
        {"event_type": "purchase", "event_id": 4},
        {"event_type": "download", "event_id": 5},
    ],
    "partial_unexpected_counts": [
        {"value": "download", "count": 1},
        {"value": "purchase", "count": 1},
        {"value": "user_signup", "count": 1},
    ],
    "unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_list": [
        {"event_type": "user_signup", "event_id": 3},
        {"event_type": "purchase", "event_id": 4},
        {"event_type": "download", "event_id": 5},
    ],
    "unexpected_index_query": "df.filter(items=[3, 4, 5], axis=0)",
}

                                    
                                  

Open the Data Docs

Run the following code to open the Data Docs:

context.open_data_docs()

The following image shows the unexpected_index_list values are displayed as part of the validation output. Click To retrieve all unexpected values to open the filter condition to retrieve all unexpected values.

Screenshot of Pandas output

Spark

Identify failed rows using Spark.

Import the Great Expectations module and instantiate a Data Context

Use the get_context() method to create a new Data Context:

import great_expectations as gx

context = gx.get_context(project_root_dir=".")

Import the Checkpoint

Run the following code to import the example Checkpoint:

my_checkpoint = context.get_checkpoint("my_checkpoint")

Set the `unexpected_index_column_names` parameter

In the following example, you're setting the result_format to COMPLETE to return the full set of results. For more information about result_format, see Result format.

result_format: dict = {
    "result_format": "COMPLETE",
    "unexpected_index_column_names": ["event_id"],
}

Run Checkpoint using `result_format`

Run the following code to apply the updated result_format to every Expectation in your Suite and pass it directly to the run() method:

results = my_checkpoint.run(result_format=result_format)

Review Validation Results

After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.

The following is an example of the information returned after running the Checkpoint. For Spark, a filter condition on the DataFrame is included.

{
    "element_count": 6,
    "unexpected_count": 3,
    "unexpected_percent": 50.0,
    "partial_unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_column_names": ["event_id"],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 50.0,
    "unexpected_percent_nonmissing": 50.0,
    "partial_unexpected_index_list": [
        {"event_id": 3, "event_type": "user_signup"},
        {"event_id": 4, "event_type": "purchase"},
        {"event_id": 5, "event_type": "download"},
    ],
    "partial_unexpected_counts": [
        {"value": "download", "count": 1},
        {"value": "purchase", "count": 1},
        {"value": "user_signup", "count": 1},
    ],
    "unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_list": [
        {"event_id": 3, "event_type": "user_signup"},
        {"event_id": 4, "event_type": "purchase"},
        {"event_id": 5, "event_type": "download"},
    ],
    "unexpected_index_query": "df.filter(F.expr((event_type IS NOT NULL) AND (NOT (event_type IN (page_load, page_view)))))",
}


                                    
                                  

Open the Data Docs

Run the following code to open the Data Docs:

context.open_data_docs()

Screenshot of Spark output

SQLAlchemy

Identify failed rows using SQLAlchemy.

Import the Great Expectations module and instantiate a Data Context

Use the get_context() method to create a new Data Context:

import great_expectations as gx

context = gx.get_context(project_root_dir=".")

Import the Checkpoint

Run the following code to import the example Checkpoint:

my_checkpoint = context.get_checkpoint("my_checkpoint")

Set the `unexpected_index_column_names` parameter

In the following example, you're setting the result_format to COMPLETE to return the full set of results. For more information about result_format, see Result format.

result_format: dict = {
    "result_format": "COMPLETE",
    "unexpected_index_column_names": ["event_id"],
}

Run Checkpoint using `result_format`

Run the following code to apply the updated result_format to every Expectation in your Suite and pass it directly to the run() method:

results = my_checkpoint.run(result_format=result_format)

Review Validation Results

After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.

The following is an example of the information returned after running the Checkpoint. For SQLAlchemy, the output includes a query that can be used directly against the database backend.

 {
    "element_count": 6,
    "unexpected_count": 3,
    "unexpected_percent": 50.0,
    "partial_unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_column_names": ["event_id"],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 50.0,
    "unexpected_percent_nonmissing": 50.0,
    "partial_unexpected_index_list": [
        {"event_id": 3, "event_type": "user_signup"},
        {"event_id": 4, "event_type": "purchase"},
        {"event_id": 5, "event_type": "download"},
    ],
    "partial_unexpected_counts": [
        {"value": "download", "count": 1},
        {"value": "purchase", "count": 1},
        {"value": "user_signup", "count": 1},
    ],
    "unexpected_list": ["user_signup", "purchase", "download"],
    "unexpected_index_list": [
        {"event_id": 3, "event_type": "user_signup"},
        {"event_id": 4, "event_type": "purchase"},
        {"event_id": 5, "event_type": "download"},
    ],
    "unexpected_index_query": "SELECT event_id, event_type \nFROM event_names \nWHERE event_type IS NOT NULL AND (event_type NOT IN ('page_load', 'page_view'));",
}


                                    
                                  

Open the Data Docs

Run the following code to open the Data Docs:

context.open_data_docs()

The following image shows the unexpected_index_list values are displayed as part of the validation output. Click To retrieve all unexpected values to view the SQL query that you can use to retrieve all unexpected values.

Screenshot of SQL output

Prerequisites​

pandas​

Import the Great Expectations module and instantiate a Data Context​

Import the Checkpoint​

Set the unexpected_index_column_names parameter​

Run Checkpoint using result_format​

Review Validation Results​

Open the Data Docs​

Spark​

Import the Great Expectations module and instantiate a Data Context​

Import the Checkpoint​

Set the unexpected_index_column_names parameter​

Run Checkpoint using result_format​

Review Validation Results​

Open the Data Docs​

SQLAlchemy​

Import the Great Expectations module and instantiate a Data Context​

Import the Checkpoint​

Set the unexpected_index_column_names parameter​

Run Checkpoint using result_format​

Review Validation Results​

Open the Data Docs​

Prerequisites

pandas

Import the Great Expectations module and instantiate a Data Context

Import the Checkpoint

Set the `unexpected_index_column_names` parameter

Run Checkpoint using `result_format`

Review Validation Results

Open the Data Docs

Spark

Import the Great Expectations module and instantiate a Data Context

Import the Checkpoint

Set the `unexpected_index_column_names` parameter

Run Checkpoint using `result_format`

Review Validation Results

Open the Data Docs

SQLAlchemy

Import the Great Expectations module and instantiate a Data Context

Import the Checkpoint

Set the `unexpected_index_column_names` parameter

Run Checkpoint using `result_format`

Review Validation Results

Open the Data Docs