Connect to filesystem source data
Use the information provided here to connect to source data stored on Amazon S3, Google Cloud Storage (GCS), Microsoft Azure Blob Storage, or local filesystems. Great Expectations (GX) uses the term source data when referring to data in its original format, and the term source data system when referring to the storage location for source data.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- Filesystem
Amazon S3 source data
Connect to source data on Amazon S3.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name
: The Data Source name. In the following examples, this is"my_s3_datasource"
-
bucket_name
: The Amazon S3 bucket name. -
boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_name
andboto3_options
:datasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options for boto3_options
The parameter
boto3_options
allows you to pass the following information:-
endpoint_url
: specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the corresponding environment variable. -
region_name
: Your AWS region name.
-
-
Run the following Python code to pass
name
,bucket_name
, andboto3_options
as parameters when you create your Data Source::datasource = context.sources.add_pandas_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full
file name with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one
Batch, which will correspond to that file.
However, if you define a partial file name
with a regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches,
one corresponding to each matched file.
You can then use the keys
year
and
month
to indicate exactly
which file you want to request from the
available Batches.
Next steps
The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name
: The Data Source name. In the following examples, this is"my_s3_datasource"
-
bucket_name
: The Amazon S3 bucket name. -
boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_name
, andboto3_options
:datasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options for boto3_options
The parameter
boto3_options
allows you to pass the following information:-
endpoint_url
: Specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the corresponding environment variable. -
region_name
: Your AWS region name.
-
-
Run the following Python code to pass
name
,bucket_name
, andboto3_options
as parameters when you create your Data Source::datasource = context.sources.add_spark_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
s3_prefix=s3_prefix,
header=True,
infer_schema=True,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full
file name with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one
Batch, which will correspond to that file.
However, if you define a partial file name
with a regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches,
one corresponding to each matched file.
You can then use the keys
year
and
month
to indicate exactly
which file you want to request from the
available Batches.
Next steps
Microsoft Azure Blob Storage
Connect to source data on Microsoft Azure Blob Storage.
- pandas
- Spark
Use
Pandas
to connect to data stored in files on a
filesystem. The following examples connect
to .csv
data.
However, GX supports most of the
Pandas read methods.
Prerequisites
- A working installation of Great Expectations with dependencies for Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name
: The Data Source name. In the following examples, this is"my_datasource"
. -
azure_options
: Authentication settings.
-
Run the following Python code to define
name
andazure_options
:datasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
} -
Run the following Python code to pass
name
andazure_options
as parameters when you create your Data Source:datasource = context.sources.add_pandas_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_url
is substituted for the contents of theAZURE_STORAGE_CONNECTION_STRING
key you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
-
name
: A name by which you can reference the Data Asset in the future. -
batching_regex
: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them. -
abs_container
: The name of your Azure Blob Storage container. -
abs_name_starts_with
: A string indicating what part of thebatching_regex
to truncate from the final batch names. -
abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
)
Next steps
Use
Spark
to connect to data stored in files on a
filesystem. The following examples connect
to .csv
data.
Prerequisites
- A working installation of Great Expectations with dependencies for Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name
: The Data Source name. In the following examples, this is"my_datasource"
. -
azure_options
: Authentication settings.
-
Run the following Python code to define
name
andazure_options
:datasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
} -
Run the following Python code to pass
name
andazure_options
as parameters when you create your Data Source:datasource = context.sources.add_spark_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_url
is substituted for the contents of theAZURE_STORAGE_CONNECTION_STRING
key you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
-
name
: A name by which you can reference the Data Asset in the future. -
batching_regex
: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them. -
abs_container
: The name of your Azure Blob Storage container. -
abs_name_starts_with
: A string indicating what part of thebatching_regex
to truncate from the final batch names. -
abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
header=True,
infer_schema=True,
abs_name_starts_with=abs_name_starts_with,
)
Next steps
GCS source data
Connect to source data on GCS.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data in a GCS bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name
: The Data Source name. In the following examples, this is"my_gcs_datasource"
. -
bucket_or_name
: The GCS bucket or instance name. -
gcs_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_or_name
, andgcs_options
:datasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name
,bucket_or_name
, andgcs_options
as parameters when you create your Data Source:datasource = context.sources.add_pandas_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, gcs_prefix=gcs_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full
file name with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one
Batch, which will correspond to that file.
However, if you define a partial file name
with a regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches,
one corresponding to each matched file.
You can then use the keys
year
and
month
to indicate exactly
which file you want to request from the
available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Related documentation
For more information on Google Cloud and authentication, see the following:
Use Spark to connect to source data stored on GCS. The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data on a GCS bucket
1. Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name
: The Data Source name. In the following examples, this is"my_gcs_datasource"
. -
bucket_or_name
: The GCS bucket or instance name. -
gcs_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_or_name
, andgcs_options
:datasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name
,bucket_or_name
, andgcs_options
as parameters when you create your Data Source:datasource = context.sources.add_spark_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
gcs_prefix=gcs_prefix,
header=True,
infer_schema=True,
)
header
and
infer_schema
In the previous example there are two
optional parameters. If the file does
not have a header line, the
header
parameter can be
left out as it will default to
false
. If you do not want
GX to infer the schema of your file,
you can exclude the
infer_schema
parameter as
it also defaults to
false
.
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full
file name with no regex groups, such as
"yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one
Batch, which will correspond to that file.
However, if you define a partial file name
with a regex group, such as
"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches,
one corresponding to each matched file.
You can then use the keys
year
and
month
to indicate exactly
which file you want to request from the
available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Related documentation
For more information on Google Cloud and authentication, see the following:
Filesystem source data
Connect to source data on a filesystem.
- Single file with pandas
- Multiple files with pandas
- Multiple files with Spark
Use
Pandas
to connect to data stored in files on a
filesystem. The following examples connect
to .csv
data.
However, GX supports most of the
Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with source data system dependencies.
- A Data Context.
- Access to source data stored in a filesystem
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Specify a file to read into a Data Asset
Run the following Python code to read the data in individual files directly into a Validator with Pandas:
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
In this example, we are connecting to
a csv file. However,
Great Expectations supports connecting
to most types of files that Pandas has
read_*
methods for.
Because you will be using Pandas to
connect to these files, the specific
add_*_asset
methods that
will be available to you will be
determined by your currently installed
version of Pandas.
For more information on which Pandas
read_*
methods are
available to you as
add_*_asset
methods,
please reference
the official Pandas Input/Output
documentation
for the version of Pandas that you
have installed.
In the GX Python API,
add_*_asset
methods will
require the same parameters as the
corresponding Pandas
read_*
method, with one
caveat: In Great Expectations, you
will also be required to provide a
value for an
asset_name
parameter.
Create Data Source (Optional)
Modify the following code to connect to your Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems.. If you don't have data available for testing, you can use the NYC taxi data. The NYC taxi data is open source, and it is updated every month. An individual record in the data corresponds to one taxi trip.
Do not include sensitive information such as credentials in the configuration when you connect to your Data Source. This information appears as plain text in the database. If you must include credentials or a full connection string, GX recommends using a config variables file.
# Give your Datasource a name
datasource_name = None
datasource = context.sources.add_pandas(datasource_name)
# Give your first Asset a name
asset_name = None
path_to_data = None
# to use sample data uncomment next line
# path_to_data = "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)
# Build batch request
batch_request = asset.build_batch_request()
Next steps
Use
Pandas
to connect to data stored in files on a
filesystem. The following examples connect
to .csv
data.
However, GX supports most of the
Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with source data system dependencies.
- A Data Context.
- Access to source data stored in a filesystem
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name
: The Data Source name. -
base_directory
: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
name
andbase_directory
and store the information in the Python variablesdatasource_name
andpath_to_folder_containing_csv_files
:datasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
base_directory
of a
Filesystem Data
Source
If you are using a Filesystem Data
Context you can provide a path for
base_directory
that is
relative to the folder containing your
Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
name
andbase_directory
as parameters when you create your Data Source:datasource = context.sources.add_pandas_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
You can access files that are nested
in folders under your Data
Source's
base_directory
!
If your source data files are split
into multiple folders, you can use the
folder that contains those folders as
your base_directory
. When
you define a Data Asset for your Data
Source, you can then include the
folder path (relative to your
base_directory
) in the
regular expression that indicates
which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name
: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex
: A regular expression that matches the files to be included in the Data Asset.
batching_regex
matches
multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file
name with a regex group, such as
r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized
("partitioned") into Batches
according to the two dimensions,
defined by the group names,
"year"
and
"month"
. When
you send a Batch Request query
featuring this Data Asset in the
future, you can use these group names
with their respective values as
options to control which Batches will
be returned. For example, you could
return all Batches in the year of
2021, or the one Batch for July of
2020.
-
Run the following Python code to define
name
andbatching_regex
and store the information in the Python variablesasset_name
andbatching_regex
:asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" -
Run the following Python code to pass
name
andbatching_regex
as parameters when you create your Data Asset:datasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)
Using Pandas to connect to different file typesIn this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has
read_*
methods for.Because you will be using Pandas to connect to these files, the specific
add_*_asset
methods that will be available to you will be determined by your currently installed version of Pandas.For more information on which Pandas
read_*
methods are available to you asadd_*_asset
methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.In the GX Python API,
add_*_asset
methods will require the same parameters as the corresponding Pandasread_*
method, with one caveat: In Great Expectations, you will also be required to provide a value for anasset_name
parameter.
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data
Assets. If you have additional files to
connect to, you can provide different
name
and
batching_regex
parameters to
create additional Data Assets for those
files in your Data Source. You can even
include the same files in multiple Data
Assets, if a given file matches the
batching_regex
of more than
one Data Asset.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
- How to use the Onboarding Data Assistant to evaluate data and create Expectations
Related documentation
For more information on Pandas
read_*
methods, see
the Pandas Input/output
documentation.
Use
Spark
to connect to data stored in files on a
filesystem. The following examples connect
to .csv
data.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with source data system dependencies.
- A Data Context.
- Access to source data stored in a filesystem
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name
: The Data Source name. -
base_directory
: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
name
andbase_directory
and store the information in the Python variablesdatasource_name
andpath_to_folder_containing_csv_files
:datasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"Using relative paths as the base_directory
of a Filesystem Data SourceIf you are using a Filesystem Data Context you can provide a path for
base_directory
that is relative to the folder containing your Data Context.However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
name
andbase_directory
as parameters when you create your Data Source:datasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)What if my source data files are split into different folders?You can access files that are nested in folders under your Data Source's
base_directory
!If your source data files are split into multiple folders, you can use the folder that contains those folders as your
base_directory
. When you define a Data Asset for your Data Source, you can then include the folder path (relative to yourbase_directory
) in the regular expression that indicates which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name
: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex
: A regular expression that matches the files to be included in the Data Asset.
batching_regex
matches
multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file
name with a regex group, such as
r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized
("partitioned") into Batches
according to the two dimensions,
defined by the group names,
"year"
and
"month"
. When
you send a Batch Request query
featuring this Data Asset in the
future, you can use these group names
with their respective values as
options to control which Batches will
be returned. For example, you could
return all Batches in the year of
2021, or the one Batch for July of
2020.
-
Run the following Python code to define
name
andbatching_regex
and store the information in the Python variablesasset_name
andbatching_regex
:asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"In addition, the argument
header
informs the SparkDataFrame
reader that the files contain a header column, while the argumentinfer_schema
instructs the SparkDataFrame
reader to make a best effort to determine the schema of the columns automatically. -
Run the following Python code to pass
name
andbatching_regex
and the optionalheader
andinfer_schema
arguments as parameters when you create your Data Asset:datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data
Assets. If you have additional files to
connect to, you can provide different
name
and
batching_regex
parameters to
create additional Data Assets for those
files in your Data Source. You can even
include the same files in multiple Data
Assets, if a given file matches the
batching_regex
of more than
one Data Asset.
Next steps
Related documentation
For more information about storing credentials for use with GX, see How to configure credentials.