Connect to data: Overview
Datasources and Data Assets provide an API for accessing and validating data on source data systems such as SQL-type data sources, local and remote file stores, and in-memory data frames.
Prerequisites
- Completion of the Quickstart guide
Workflow
A DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. provides a standard API for accessing and interacting with data from different source systems.
A Datasource provides an interface for an Execution EngineA system capable of processing data to compute Metrics. and possible external storage, and it allows Great Expectations to communicate with your source data systems.
To connect to data, you add a new Datasource to your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. according to the requirements of your underlying data system. After you've configured your Datasource, you'll use the Datasource API to access and interact with your data, regardless of the original source systems that you use to store data.
Configure your Datasource
Your existing data systems determine how you connect to each Datasource type. To help you with your Datasource implementation, use one of the GX how-to guides for your specific use case and source data systems.
You configure a Datasource with Python and the GX Fluent Datasource API. A typical Datasource configuration appears similar to the following example:
import great_expectations as gx
context = gx.get_context()
context.sources.add_pandas_filesystem(
name="version-0.16.16 my_pandas_datasource", base_directory="./data"
)
The name
key is a descriptive name for
your Datasource. The
add_<datasource>
method takes the
Datasource-specific arguments that are used to
configure it. For example, the
add_pandas_filesystem
takes a
base_directory
argument in the previous
example, while the
context.sources.add_postgres(name, ...)
method takes a connection_string
that is
used to connect to the database.
Call the add_<datasource>
method in
your context to run configuration checks. For example,
it makes sure the base_directory
exists
for the pandas_filesystem
Datasource and
the connection_string
is valid for a SQL
database.
These methods also persist your Datasource to your Data Context. The storage location for a Datasource and its reusability are determined by the Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. type. For a File Data Context the changes are persisted to disk, for a Cloud Data Context the changes are persisted to the cloud, and for an Ephemeral Data Context the data remains in memory and don't persist beyond the current Python session.
View your Datasource configuration
The context.datasources
attribute in your
Data Context allows you to access your Datasource
configuration. For example, the following command
returns the Datasource configuration:
datasource = context.datasources["my_pandas_datasource"]
print(datasource)