Connect to in-memory source data
Use the information provided here to connect to an in-memory pandas or Spark DataFrame. Great Expectations (GX) uses the term source data when referring to data in its original format, and the term source data system when referring to the storage location for source data.
- pandas
- Spark
pandas
pandas can read many types of data into its DataFrame class, but the following examples use data originating in a parquet file.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with source data system dependencies.
- A Data Context.
- Access to data that can be read into a Pandas DataFrame
Import the Great Expectations module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
Run the following Python code to create a Pandas Data Source:
datasource = context.sources.add_pandas(name="my_pandas_datasource")
Read your source data into a Pandas DataFrame
In the following example, a parquet file is read into a Pandas DataFrame that will be used in subsequent code examples.
Run the following Python code to create the Pandas DataFrame:
import pandas as pd
dataframe = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-11.parquet"
)
Add a Data Asset to the Data Source
The following information is required when you create a Pandas DataFrame Data Asset:
-
name
: The Data Source name. -
dataframe
: The Pandas DataFrame containing the source data.
The DataFrame you created previously is the
value you'll enter for
dataframe
parameter.
-
Run the following Python code to define the
name
parameter and store it as a Python variable:name = "taxi_dataframe"
-
Run the following Python code to create the Data Asset:
data_asset = datasource.add_dataframe_asset(name=name)
For
dataframe
Data Assets, thedataframe
is always specified as the argument of one API method. For example:my_batch_request = data_asset.build_batch_request(dataframe=dataframe)
Next steps
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Related documentation
For more information on Pandas read methods, see the Pandas Input/Output documentation.
Spark
Connect to in-memory source data using Spark.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with source data system dependencies.
- A Data Context.
- Access to data that can be read into a Spark
- An active Spark Context
Import the Great Expectations module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
Run the following Python code to create a Spark Data Source:
datasource = context.sources.add_spark("my_spark_datasource")
Read your source data into a Spark DataFrame
In the following example, you'll create a simple Spark DataFrame that is used in the following code examples.
Run the following Python code to create the Spark DataFrame:
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6],
"b": [100, 200, 300, 400, 500, 600],
"c": ["one", "two", "three", "four", "five", "six"],
},
index=[10, 20, 30, 40, 50, 60],
)
dataframe = spark.createDataFrame(data=df)
Add a Data Asset to the Datasource
The following information is required when you create a Spark DataFrame Data Asset:
-
name
: The Datasource name. -
dataframe
: The Spark DataFrame containing the source data.
The DataFrame you created previously is the
value you'll enter for
dataframe
parameter.
-
Run the following Python code to define the
name
parameter and store it as a Python variable:name = "my_df_asset"
-
Run the following Python code to create the Data Asset:
data_asset = datasource.add_dataframe_asset(name=name)
For
dataframe
Data Assets, thedataframe
is always specified as the argument of one API method. For example:my_batch_request = data_asset.build_batch_request(dataframe=dataframe)
Next steps
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Related documentation
For more information on Spark read methods, see the Spark Input/Output documentation.