Technical Requirements
-
DarkLight
Technical Requirements
-
DarkLight
Overview
This article outlines the current technical requirements and limitations for Matillion Data Loader.
Below are the technical requirements for Snowflake, Amazon Redshift, Google BigQuery, Amazon S3, Azure Blob, Delta Lake on Databricks, and Google Cloud Storage.
Batch
Snowflake (AWS or Azure)
- Matillion Data Loader doesn't support SSH tunneling or PrivateLink, so you must either have a publicly accessible Snowflake account or set up an SSH host that's publicly accessible and can forward traffic to the customer private cloud that has the PrivateLink to Snowflake setup. Our recommendation would be to use a publicly available Snowflake account if available.
- Snowflake username and password for the Snowflake instance used during testing.
- Authentication for any third-party data sources. On configuring the data source, you will be prompted to grant Matillion Data Loader access to your data source and you're free to choose which account you use during that authorization. This could be:
- Usernames/passwords for JDBC-accessible databases.
- OAuth for most others.
Amazon Redshift
- A shared job won't extract columns that contain upper case letters. You may see an error message regarding NULL values or the job may still complete, but with NULL values in place of the data that wasn't extracted. To resolve this issue, you must set the Redshift parameter,
enable_case_sensitive_identifier
to True. You can do this by altering the user, or updating the Redshift Parameter Group. See here for examples. - Matillion Data Loader doesn't support SSH tunneling or PrivateLink, so you must either have a publicly accessible Redshift cluster or set up an SSH host that's publicly accessible and can forward traffic to the customer private cloud Redshift is running inside.
- Our recommendation would be to use a separate Amazon Redshift cluster (a single
dc2.large
node should be sufficient) for the purposes of testing; Matillion will reimburse reasonable charges incurred on submission of an AWS bill detailing the cluster used. Please don't test on 50 8XL nodes. - An
AWS access
andsecret key
relating to an existing IAM role that can read/write to S3. S3 is used as a staging area and although no objects will be left behind permanently, we need to read/write to S3 objects temporarily during processing. - Amazon Redshift username and password for the Redshift instance used during testing.
- Authentication for any third-party data sources. On configuring the data source, you will be prompted to grant Matillion access to your data source and you are free to choose which account you use during that authorization. This could be:
- Usernames/passwords for JDBC-accessible databases.
- OAuth for most others.
Google BigQuery
- Matillion Data Loader requires a Google Service Account which is configured to be able to use Google BigQuery and Google Cloud Storage. Google Cloud Storage (GCS) is used as a staging area and, although no objects will be left behind permanently, we need to read/write to GCS objects temporarily during processing.
- Authentication for any third-party data sources. On configuring the data source, you will be prompted to grant Matillion access to your data source and you are free to choose which account you use during that authorization. This could be:
- Usernames/passwords for JDBC accessible databases.
- OAuth for most others.
Delta Lake on Databricks
- Matillion Data Loader doesn't support SSH tunneling or PrivateLink, so you must either have a publicly accessible Databricks account or set up an SSH host that's publicly accessible and can forward traffic to the VPC that Databricks is running inside. Our recommendation would be to use a publicly available Databricks account if available.
- Databricks username and password for the instance used during testing.
- Authentication for any third-party data sources. On configuring the data source, you will be prompted to grant Matillion Data Loader access to your data source and you're free to choose which account you use during that authorization. This could be:
- Usernames/passwords for JDBC-accessible databases.
- OAuth for most others.
CDC
Amazon S3
- An Amazon Web Services (AWS) account. Signing up is free - click here or go to https://aws.amazon.com to create an account if you don't have one already.
- Amazon username and password for the Amazon instance used during testing.
- Permissions to create and manage S3 buckets in AWS. Your AWS user must be able to create a bucket (if one doesn't already exist), add/modify bucket policies, and upload files to the bucket.
- The IAM role used by the Agent container has
putObject
permissions for the S3 bucket and its prefix to be used as the destination by the pipeline. - An up and running Amazon S3 bucket.
Azure Blob Storage
- Your destination should be an Azure Storage account that supports containers, such as BlobStorage, Storage, or StorageV2.
- At the minimum, the role
Reader & Data Access
is required for sufficient permissions. The role should be applicable for the Azure Storage account in which your destination container is located. - The destination container needs to use an access key for authentication.
- The agent container needs to use a shared key injected as an environment variable for authentication to the storage container.
- If your storage account only allows access from selected networks, IP "allowlisting" is needed.
Google Cloud Storage
- A Google Cloud Storage account. You can sign up with a trial account for free. Go to https://cloud.google.com/storage to sign up.
- Google username and password for the Google Cloud Storage account used during testing.
- An up and running Google Cloud Storage bucket.
- Permissions to create and manage storage in your Google Cloud Storage account. Your Google Cloud Storage user must be able to create a bucket (if one doesn't already exist), add/modify bucket policies, and upload files to the bucket.