How to connect to a SQL database
In this guide we will demonstrate how to connect Great Expectations to SQL databases. GX uses SQLAlchemy to connect to SQL data, and therefore supports most SQL dialects that SQLAlchemy does. For more information on the SQL dialects supported by SQLAlchemy, please see SQLAlchemy's official documentation on dialects.
Prerequisites
- An installation of GX set up to work with SQL
- Source data stored in a SQL database
Steps
1. Import GX and instantiate a Data Context
The code to import Great Expectations and instantiate a Data Context is:
import great_expectations as gx
context = gx.get_context()
2. Determine your connection string
GX supports a variety of different SQL source data systems. However, most SQL dialects have their own specifications for how to define a connection string. You should reference the corresponding dialect's official documentation to determine the connection string for your SQL Database.
The following are examples of connection strings for different SQL dialects:
-
AWS Athena:
awsathena+rest://@athena.<REGION>.amazonaws.com/<DATABASE>?s3_staging_dir=<S3_PATH>
-
BigQuery:
bigquery://<GCP_PROJECT>/<BIGQUERY_DATASET>?credentials_path=/path/to/your/credentials.json
-
MSSQL:
mssql+pyodbc://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?driver=<DRIVER>&charset=utf&autocommit=true
-
MySQL:
mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>
-
PostGreSQL:
postgresql+psycopg2://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>
-
Redshift:
postgresql+psycopg2://<USER_NAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?sslmode=<SSLMODE>
-
Snowflake]:
snowflake://<USER_NAME>:<PASSWORD>@<ACCOUNT_NAME>/<DATABASE_NAME>/<SCHEMA_NAME>?warehouse=<WAREHOUSE_NAME>&role=<ROLE_NAME>&application=great_expectations_oss
-
SQLite:
sqlite:///<PATH_TO_DB_FILE>
-
Trino:
trino://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<CATALOG>/<SCHEMA>
For purposes of this guide's examples, we will
connect to a PostGreSQL database. Here is an example
of our connection string, stored in the variable
connection_string
with plain text
credentials:
connection_string = "postgresql+psycopg2://username:my_password@localhost/test"
You can use either environment variables or a key
in config_variables.yml
to safely
store any passwords needed by your connection
string. After defining your password in one of
those ways, you can reference it in your
connection string like this:
connection_string = (
"postgresql+psycopg2://<username>:${MY_PASSWORD}@<host>:<port>/<database>"
)
In the above example
MY_PASSWORD
would be the name of the
environment variable or the key to the value in
config_variables.yml
that corresponds
to your password.
If you include a password as plain text in your
connection string when you define your Datasource,
GX will automatically strip it out, add it to
config_variables.yml
and substitute
it in the Datasource's saved configuration
with a variable as was shown above.
3. Create a SQL Datasource
Creating a SQL Datasource is as simple as providing
the add_sql(...)
method a
name
by which to reference it in the
future and the connection_string
with
which to access it.
datasource = context.sources.add_sql(
name="version-0.16.16 my_datasource", connection_string=connection_string
)
Next steps
Now that you have connected to a SQL database, next you will want to: