Skip to main content
Version: 0.17.23

Data Source

A Data Source provides a standard API for accessing and interacting with data from a wide variety of source systems.

Datasources provide a standard API across multiple backends: the Data Source API remains the same for PostgreSQL, CSV Filesystems, and all other supported data backends.

Important:

Datasources do not modify your data.

Relationship to other objects

Datasources function by bringing together a way of interacting with Data (an Execution EngineA system capable of processing data to compute Metrics.) with a definition of the data to access (a Data Asset). Batch RequestsProvided to a Datasource in order to create a Batch. utilize a Datasources' Data Assets to return a BatchA selection of records from a Data Asset. of data.

Use Cases

When connecting to data the Data Source is your primary tool. At this stage, you will create Datasources to define how Great Expectations can find and access your Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification.. Under the hood, each Data Source uses an Execution Engine (ex: SQLAlchemy, Pandas, and Spark) to connect to and query data. Once a Data Source is configured you will be able to operate with the Data Source's API rather than needing a different API for each possible data backend you may be working with.

When creating ExpectationsA verifiable assertion about data., you'll use your Data Sources to obtain BatchesA selection of records from a Data Asset. for analysis and for your Expectation SuitesA collection of verifiable assertions about data.. For example, when you use the interactive workflow to create new Expectations.

Datasources are also used to obtain Batches for ValidatorsUsed to run an Expectation Suite against data. to run against when you are validating data.

Standard API

Datasources support connecting to a variety of different data backends. No matter which source data system you employ, the Data Source's API will remain the same.

No unexpected modifications

Datasources do not modify your data during profiling or validation, but they may create temporary artifacts to optimize computing Metrics and Validation (this behavior can be configured).

Create and access

Datasources can be created and accessed using Python code, which can be executed from a script, a Python console, or a Jupyter Notebook. To access a Data Source all you need is a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. and the name of the Data Source. The below snippet shows how to create a Pandas Data Source for local files:

import great_expectations as gx

context = gx.get_context()
context.sources.add_pandas_filesystem(
name="my_pandas_datasource", base_directory="./data"
)

This next snippet shows how to retrieve the Data Source from the Data Context.

datasource = context.datasources["my_pandas_datasource"]
print(datasource)

For detailed instructions on how to create Datasources that are configured for various backends, see our documentation on Connecting to Data.