Create a CDC pipeline
A pipeline in Matillion CDC is a collection of configuration details, including the source configuration, target configuration, and any advanced properties that allow the CDC agent to begin monitoring and consuming database changes and delivering them to the appropriate data lake.
The specific detail of configuring a CDC pipeline depends on the data source being used, but the general process of creating and configuring the pipeline is the same in each case, and is described in this article.
Prior to creating a pipeline, you must create an agent. For more information, read Adding Agents.
Before you can configure a pipeline, you will typically require:
- An active account on the data source, with sufficient permissions to access it and configure it for CDC.
- A running agent.
- A running instance of the data source.
- Connection details such as host name, database name, and login details to connect CDC to the data source.
Other prerequisites may be necessary, depending on the source platform.
- In Matillion Data Loader, click Add pipeline.
- Under Load data type in the sidebar, click CDC.
- Choose your data source from the grid of supported sources.
- Click Change Data Capture.
Select an agent
Before you connect to the source, you will need to select an agent. For more information, read the following:
Connect to the source
Prior to connecting to a data source, you must select an agent before configuring the connection to the data source. The exact details required for the connection will depend on the data source, but typically you will need connection details such as:
- Server address (the URL, domain name, or host name of the server hosting your source database).
- Port (the port used to communicate with the source database).
- Username, and secret name that you have defined, for an account on the source database.
- Some sources may require additional details such as a database name or REST endpoint.
Optionally, if supported by the data source, you may specify additional JDBC connection parameters. Click Advanced settings and then choose a parameter from the dropdown list and enter a value for the parameter. Click Add parameter for each extra parameter you want to add. The documentation for your source data model will provide information about supported JDBC connection settings.
Once you have entered all connection settings, click Test and continue.
Select any schema of your choice from which you would like to load the tables. Use the arrow buttons to move schemas to the Selected schemas listbox, and then reorder any schema with click-and-drag. You can select multiple schema using the
SHIFT key. Click Continue with X schema to move forward.
You should be able to choose the tables in those selected schemas on the following screen.
Choose any tables you wish to include in the pipeline. Use the arrow buttons to move tables and schemas to the Tables to extract and load listbox, and then reorder any tables with click-and-drag. You can select multiple tables using the
You can select any schemas and tables that you wish to include in the pipeline, and all columns in the intended tables will be selected.
Click Continue with X tables to move forward.
Table column names will be sanitized to adhere to the following rules:
- Start with [
- Subsequently contains only [
_] characters (no spaces).
I have a space→
This sanitization can result in duplicate field names, preventing the pipeline from running.
duplicate$. -> duplicate__
duplicate.$ -> duplicate__
To overcome this, the
column.exclude.list advanced setting can be used to exclude any columns which may clash with other sanitized fields.
You must ensure that the selected tables are enabled for CDC in the source. The requirement for this will depend on the data source being used, and is described in the documentation for each source.
Once you configure the source, you must connect to a destination to load your data into.
Choose an existing destination or click Add a new destination.
Select a destination from the available options and specify the connection settings, as described in the following articles:
- Give your pipeline a unique name, so you can use it later.
- You can disable/enable the snapshotting phase for the chosen source.
To learn more about snapshotting options, read the "Advanced Settings" section of the documentation for your CDC source.
Finally, click Create Pipeline.
Customers are charged for the first full load snapshot for the dataset you select. For example, if you build up a pipeline and enable a snapshot on a table with 1000 rows, you will be charged for 1000 rows. However, if the pipeline stops for any reason and the agent drifts out of sync, or you need to reset and recreate the pipeline, necessitating another snapshot. You will NOT be charged of the original PipelineID and AgentID have not changed
Managing the pipeline
Once your pipeline has been created, you can manage it through the Pipelines dashboard. The dashboard lists a summary of each existing pipeline, including its Name, Source, Destination, and Status.
The pipeline's Status will be one of the following:
- Unavailable: The pipeline's agent is not currently connected. This may indicate a fault in the agent's installation.
- Not running: The pipeline's agent is connected but is stopped and not running currently.
- Snapshotting: The pipeline is performing a snapshot to get the initial database state required for CDC to start.
- Streaming: The pipeline is streaming change records to cloud storage.
Click ... next to any pipeline to open a dialog showing the Agent name the pipeline is associated with, and the pipeline's Throughput. From here you can Stop or Delete the pipeline. If the pipeline is in a streaming state, you can use Start to restart it.
To edit a CDC pipeline (for example, to add or remove tables or change schema), you must delete the existing pipeline and rebuild it with the required new details. Read Edit a CDC pipeline for details of this process.
Read Pipeline Dashboard Overview for more details of using this screen.