Tutorial, Step 2: Connect to data
|  |  |  |  |  |  |  | 
- Completed Step 1: Setup of this tutorial.
In Step 1: Setup, we created a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.. Now that we have that Data Context, you'll want to connect to your actual data. In Great Expectations, DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. simplify these connections by managing and providing a consistent, cross-platform API for referencing data.
Create a Datasource with the CLI
Let's create and configure your first Datasource: a connection to the data directory we've provided in the repo. This could also be a database connection, but because our tutorial data consists of .CSV files we're just using a simple file store.
                          Start by using the
                          CLICommand Line Interface
                          to run the following command from your
                          ge_tutorials directory:
                        
great_expectations datasource new
You will then be presented with a choice that looks like this:
What data would you like Great Expectations to connect to?
    1. Files on a filesystem (for processing with Pandas or Spark)
    2. Relational database (SQL)
:1
The only difference is that we've included a "1" after the colon and you haven't typed anything in answer to the prompt, yet.
                          As we've noted before, we're working with
                          .CSV files. So you'll want to answer with
                          1 and hit enter.
                        
The next prompt you see will look like this:
What are you processing your files with?
    1. Pandas
    2. PySpark
:1
                          For this tutorial we will use Pandas to process our
                          files, so again answer with 1 and press
                          enter to continue.
                        
                              When you select 1. Pandas from the
                              above list, you are specifying your
                              Datasource's
                              Execution EngineA system capable of processing data to
                                  compute Metrics.. Although the tutorial uses Pandas, Spark and
                              SqlAlchemy are also supported as Execution
                              Engines.
                            
We're almost done with the CLI! You'll be prompted once more, this time for the path of the directory where the data files are located. The prompt will look like:
Enter the path of the root directory where the data files are stored. If files are on local disk
enter a path relative to your current working directory or an absolute path.
:data
                          The data that this tutorial uses is stored in
                          ge_tutorials/data. Since we are working
                          from the ge_tutorials directory, you only
                          need to enter data and hit return to
                          continue.
                        
This will now open up a new Jupyter Notebook to complete the Datasource configuration. Your console will display a series of messages as the Jupyter Notebook is loaded, but you can disregard them. The rest of the Datasource setup takes place in the Jupyter Notebook and we won't return to the terminal until that is done.
                          The datasource new notebook
                        
                        The Jupyter Notebook contains some boilerplate code to configure your new Datasource. You can run the entire notebook as-is, but we recommend changing at least the Datasource name to something more specific.
Edit the second code cell as follows:
datasource_name = "getting_started_datasource"
Then execute all cells in the notebook in order to save the new Datasource. If successful, the last cell will print a list of all Datasources, including the one you just created.
Before continuing, let’s stop and unpack what just happened.
Configuring Datasources
When you completed those last few steps, you told Great Expectations that:
- 
                            You want to create a new Datasource called
                            getting_started_datasource(or whatever custom name you chose above).
- You want to use Pandas to read the data from CSV.
                          Based on that information, the CLI added the following
                          entry into your
                          great_expectations.yml file, under the
                          datasources header:
                        
name: getting_started_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
  module_name: great_expectations.execution_engine
  class_name: PandasExecutionEngine
data_connectors:
    default_runtime_data_connector_name:
        class_name: RuntimeDataConnector
        batch_identifiers:
            - default_identifier_name
    default_inferred_data_connector_name:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: ../data/
        default_regex:
          group_names:
            - data_asset_name
          pattern: (.*)
                          Please note that due to how data is serialized, the
                          entry in your great_expectations.yml file
                          may not have these key/value pairs in the same order
                          as the above example. However, they will all have been
                          added.
                        
What does the configuration contain?
                                  ExecutionEngine : The
                                  Execution EngineA system capable of processing data to
                                      compute Metrics.
                                  provides backend-specific computing resources
                                  that are used to read-in and perform
                                  validation on data. For more information on
                                  ExecutionEngines, please refer to
                                  the following
                                  Core Concepts document on
                                      ExecutionEnginesA system capable of processing data to
                                      compute Metrics.
                                
                                  DataConnectors :
                                  Data ConnectorsProvides the configuration details based
                                      on the source data system which are needed
                                      by a Datasource to define Data
                                      Assets.
                                  facilitate access to external data stores,
                                  such as filesystems, databases, and cloud
                                  storage. The current configuration contains
                                  both an
                                  InferredAssetFilesystemDataConnector, which allows you to retrieve a batch of
                                  data by naming a data asset (which is the
                                  filename in our case), and a
                                  RuntimeDataConnector, which
                                  allows you to retrieve a batch of data by
                                  defining a filepath. In this tutorial we will
                                  only be using the
                                  InferredAssetFilesystemDataConnector. For more information on
                                  DataConnectors, please refer
                                  here:
                                  Data ConnectorsProvides the configuration details based
                                      on the source data system which are needed
                                      by a Datasource to define Data
                                      Assets..
                                
                                  This Datasource does not require any
                                  credentials. However, if you were to connect
                                  to a database that requires connection
                                  credentials, those would be stored in
                                  great_expectations/uncommitted/config_variables.yml.
                                
                          In the future, you can modify or delete your
                          configuration by editing your
                          great_expectations.yml and
                          config_variables.yml files directly.
                        
For now, let’s move on to Step 3: Create Expectations.