Matillion Change Data Capture
  • Dark
    Light

Matillion Change Data Capture

  • Dark
    Light

Note

This page relates to Matillion Data Loader Change Data Capture (CDC), a new software-as-a-service (SaaS) product and is completely separate and does not relate to the existing Matillion ETL Change Data Capture (CDC) product.

Introducing Matillion CDC

Matillion Change Data Capture (or CDC) is a near real-time data ingestion architecture that, along with Matillion ETL, offers a complete end-to-end solution for capturing, transforming, manipulating, and synchronizing data. Matillion CDC combines its specific data ingestion proficiency with Matillion ETL's best-in-class data transformation and load capabilities.

Matillion CDC provides an intuitive user experience to quickly get you building and running pipelines against popular source databases and data lakes.


Matillion CDC overview

Matillion CDC is an agent-based change data capture solution which monitors and consumes all change events from the connected source database.

What is Change Data Capture?

Change data capture is a proven data integration pattern to identify and capture changes from a source database in near-real-time and deliver them to various destinations for further processing. Most database technologies write low level transaction logs as changes occur within the database. In many database technologies, these logs are crucial and assist in disaster recovery scenarios. CDC based technologies are able to hook into these transaction logs and ingest all changes as they're happening.

As databases grow exponentially over time, performing larger batch based operations can be inefficient, slow and resource intensive. Therefore, monitoring and consuming changes as they occur simplifies the replication process, is incredibly efficient and consumes fewer compute resources. Change data capture helps maintain consistency and functionality across all systems that rely on this data.

How does Matillion CDC use these logs?

Matillion CDC can leverage these low level logs to give a consistent and up to date picture of the connected source database and any tables it is monitoring.

Due to the nature of using these transaction logs, resource usage on the source database is kept to a minimum. Day to day processing and transacting on the source database is therefore unaffected by this log consumption approach.

The Matillion CDC agent will—once configured and started—continue to operate without the need of further user intervention. The agent will continue to monitor all changes, consume those changes, and deliver them to the appropriate target data lake. Users, if they wish to, can then monitor the state of the agent as well as the CDC pipeline through the Matillion Data Loader. The UI has convenient command control features as well as monitoring capabilities to keep users informed of the current state of their pipelines.

Matillion CDC works hand-in-hand with Matillion ETL, using the CDC shared jobs supplied with Matillion ETL, to give you powerful transformation and load capabilities. All change events captured by Matillion CDC can be seamlessly picked up by Matillion ETL and used to support many different use cases such as:

  • Near real time data synchronization and replication
  • Artificial intelligence and machine learning
  • Fraud detection and audit
  • Real-time marketing campaigns
  • Cloud migration

Architecture

Hybrid SaaS approach

To allow you to keep control of your data, Matillion CDC takes a hybrid approach with self-hosted agents that capture and handle the data. Once these agents are deployed, they can be controlled and monitored through the cloud user interface.

Matillion CDC Architecture

The above diagram lays out the different components in the system. The Matillion CDC Agent will connect to the source database, process the change events and write them out in near real-time partitioned batches to the cloud storage destination. If deployed alongside the source database in the same customer private cloud, then the change data will remain within that customer private cloud throughout. The SaaS interface provides the capabilities to configure the CDC Pipeline for an Agent as well as monitor the pipeline status. To support this, the Matillion CDC Agent establishes a secure connection back to the Matillion SaaS platform. Finally, the Agent requires access to a cloud secrets service where Agent authentication and connection secrets are stored.

The details for how to configure the source database and the specific capture process varies by connector. For more detail on specific source databases, refer to the corresponding connector page.

The Agent deployment instructions can be found here.

Note

This architecture means that the change data that's being streamed from the source and written to the destination by the Matillion CDC Agent remains within your infrastructure. Your change data is never sent to the Matillion SaaS Platform.

Batching

A running CDC pipeline will write out files containing all the changes to the specified storage destination. As changes are emitted from the source database, they're buffered within the Matillion CDC Agent and the files within a partition (see following section) are written when either the oldest change in a partition reaches a time threshold, or a file size limit is reached. This means that no changes should be older than the configured buffer time—default of 60 seconds—from the time the change was read by the agent.

Batching Process

Files

The generated change files are partitioned by the database table and UTC time of the change, plus each file will have a unique name. The full path description is:

<configured_prefix>/<database>/<schema>/<table>/<year>/<month>/<day>/<hour>/<filename></filename>

Each file contains a series of change records in the Avro file format. This Avro format is a supported format for optimized ingestion into a cloud data warehouse. This is the same common structure for each source, however variations may occur for individual keys. Each row in the file is a change record and contains the following structure (the below uses PostgreSQL as an example):

{
"before": null,
"after": {
"actor_id": "70ac0033-c25b-7687-5a86-6861c08cabdd",
"first_name": "john",
"last_name": "smith",
"last_update": 1635638400123456,
"version": 0
},
"metadata": {
"connector": "postgresql",
"db": "postgres_db",
"key": ["actor_id"],
"lsn": 37094192,
"name": "matillion",
"op": "r",
"schema": "public",
"table": "actor",
"ts_ms": 1635638400234,
"txId": 543,
"version": "1.7.1.Final"
}
}

The op field contains the type of the change for this record, r = read (during snapshot) c = create, d = delete, u = update.

The before and after fields contain the values in that row as they were before and after the change was applied, and as such the fields will differ by table. In the case where a record was created, the value for the before field will be empty and in the case where a record was deleted, the value for the after field will be empty.


Core concepts

There are some important concepts to understand within the Matillion CDC product, these are detailed below:

Concept Description
Matillion CDC Agent The agent is a service that needs to be deployed and configured within the user's technology stack and responsible for securely connecting via a secrets management application to both the source database and the target cloud data lake destination. Information security is of paramount importance to Matillion, which is why the agent provides a happy medium between the information reported back to Matillion CDC SaaS platform and any sensitive data, which never leaves the user's technology stack. For further information, including installation instructions, please review the Setting Up CDC Agents documentation here.
Pipeline A pipeline is a collection of configuration details, including the source configuration, target configuration, and any advance properties that allow the CDC agent to begin monitoring and consuming database changes and delivering them to the appropriate data lake. There is a 1-to-1 relationship between a CDC agent and a pipeline, therefore multiple agents will be required to configure and run multiple pipelines.
Source The source (or source configuration) is a collection of connection properties for the source database you wish to configure a CDC pipeline for.
Target The target (or destination/target configuration) is a collection of properties where the agent will deliver change event data captured from the source database.

Next steps

The CDC agent writes the generated change files, as described above, into cloud storage. Once the change files are in cloud storage, additional processing is then required to read the data from those files and load it into a target table in your chosen cloud data warehouse. This processing is performed by Matillion ETL, using the CDC shared jobs that are provided within Matillion ETL.

CDC shared jobs are available in Matillion ETL on the following platforms:

  • Snowflake
  • Redshift
  • Delta Lake on Databricks
  • BigQuery
Note

Shared jobs for Azure Synapse Analytics are not currently available in Matillion ETL for Synapse on Azure.


Useful links