-
DarkLight
CDC Shared Jobs Overview
-
DarkLight
Overview
These shared jobs do not work with the Manage Change Data Capture feature in Matillion ETL. Instead, these shared jobs are for the CDC loading feature in Matillion Data Loader.
Change Data Capture (CDC) and processing data from your chosen sources—using a CDC agent to load data to your chosen cloud storage area—is supported within Matillion ETL by three shared jobs. These shared jobs facilitate two data-loading use cases.
Data is loaded to your chosen cloud storage area by the CDC agent installed on your source database. The data is loaded by the agent into your choice of Amazon S3, Google Cloud Storage, or Azure Blob storage.
When using Redshift, a shared job won't extract columns that contain upper case letters. You may see an error message regarding NULL values or the job may still complete, but with NULL values in place of the data that wasn't extracted. To resolve this issue, you must set the Redshift parameter, enable_case_sensitive_identifier
to True. You can do this by altering the user, or updating the Redshift Parameter Group. See here for examples.
The shared jobs below can then read that data from cloud storage, and that data is loaded into target tables on your chosen data platform. The process can load a single table or all tables, based on the chosen configuration and shared job.
Schema drift support is also available and can be toggled using the appropriate job parameter Schema Drift Action
which is fully described in the appropriate configuration guide for your use case. Additionally, see the section further down on schema drift support, which explains how this functionality is implemented.
The processes within the shared jobs gather information by reading the files from cloud storage (including metadata regarding the source, any change events, and also whether any source table changes have occurred). This information is then used to facilitate a workflow to load agent-provided data into the target, based on the files loaded into cloud storage.
There may be nuances in terms of the data typing conversion and support between sources to each cloud data platform. These should be considered when creating your workflow with regard to what each data type will be presented as in your target table or tables. This is covered below.
Below you will find two use cases. Configure the shared jobs dependent on your use case for your cloud data platform as per the relevant section. The suggested default is Use Case 1 (Sync All Tables), as this is a self contained process, bearing in mind the caveat surrounding key requirements for the Copy Table transformation type, which is explained in the configuration help below.
Use Case 1 - Sync All Tables
The purpose of this use case is to process CDC data from a Snowflake external table for all source tables, and then keep matching target tables in sync with the source tables.
First, place the Sync All Tables shared job onto the job canvas of an orchestration job from the Shared Jobs palette.
If the CDC files are stored in Azure Blob Storage, the following Python3 packages must be installed on the Matillion ETL instance:
azure-identity
azure-keyvault-secrets
azure-storage-blob
The Sync All Tables shared job contains a call to the Create or Refresh External Table shared job by default on Snowflake, so a standalone process to create or refresh the external table does not need to be additionally added to your workflow when using the Sync All Tables shared job on Snowflake.
Use Case 2 - Sync Single Table
The purpose of this use case is to process CDC data from a Snowflake external table for a single source table, and then keep a target table in sync with the source table.
First, place the Sync Single Table shared job onto the job canvas of an orchestration job from the Shared Jobs palette.
If the CDC files are stored in Azure Blob Storage, the following Python3 packages must be installed on the Matillion ETL instance:
azure-identity
azure-keyvault-secrets
azure-storage-blob
(Snowflake only) To ensure that the external table exposes the data from Amazon S3, Google Cloud Storage, or Azure Blob Storage, the shared job Create or Refresh External Table must be run to provide an updated view of the data in your chosen cloud storage area.
Therefore, the Create or Refresh External Table shared job is added to the orchestration job canvas. These shared jobs are linked, as in the following workflow.
Shared job links
Schema Drift support
Schema drift support accommodates changes made to the source data, such as:
- Missing columns as a result of changes in the source schema. Missing columns are loaded as NULL.
- Data type changes for specified columns in the shared job configuration.
- Tables no longer present in the source schema. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Columns grid variable to remove the missing table.
- Manual addition of new tables or columns via the Table and Columns grid variable within the existing shared job configuration. If a new table or column is added to your source, it's not added to the shared job configuration as a default behavior. However, any new tables or columns can be added manually.
To learn more, read Schema Drift.
Data Type changes
Data Type changes will also be accommodated, but if these are not compatible changes for the target cloud data platform, the current column will be renamed as <column_name>_datetime
and the column repurposed as the new data type. The format of the datetime extension is _yyyymmddhhmmss
, for example, _20210113110334
, and will be the same for all columns in the same table in the same shared job configuration. The new column will be NULL up to the date of change. This should be considered for downstream dependencies such as views and reports.
Data type tables can be found at Data typing with CDC shared jobs.
Databricks runtime releases
The CDC shared jobs in Matillion ETL currently support LTS 9.1 or later.