Skip to main content
Version: 0.17.23

Connect to in-memory source data

Use the information provided here to connect to an in-memory pandas or Spark DataFrame. Great Expectations (GX) uses the term source data when referring to data in its original format, and the term source data system when referring to the storage location for source data.

pandas

pandas can read many types of data into its DataFrame class, but the following examples use data originating in a parquet file.

Prerequisites

Import the Great Expectations module and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

import great_expectations as gx

context = gx.get_context()

Create a Data Source

Run the following Python code to create a Pandas Data Source:

datasource = context.sources.add_pandas(name="my_pandas_datasource")

Read your source data into a Pandas DataFrame

In the following example, a parquet file is read into a Pandas DataFrame that will be used in subsequent code examples.

Run the following Python code to create the Pandas DataFrame:

import pandas as pd

dataframe = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-11.parquet"
)

Add a Data Asset to the Data Source

The following information is required when you create a Pandas DataFrame Data Asset:

  • name: The Data Source name.

  • dataframe: The Pandas DataFrame containing the source data.

The DataFrame you created previously is the value you'll enter for dataframe parameter.

  1. Run the following Python code to define the name parameter and store it as a Python variable:

    name = "taxi_dataframe"
  2. Run the following Python code to create the Data Asset:

    data_asset = datasource.add_dataframe_asset(name=name)

    For dataframe Data Assets, the dataframe is always specified as the argument of one API method. For example:

    my_batch_request = data_asset.build_batch_request(dataframe=dataframe)

Next steps

For more information on Pandas read methods, see the Pandas Input/Output documentation.