MDL Pipeline UI
A pipeline moves your data from a source system to a destination database. The pipeline is a collection of configuration details including the source configuration, destination configuration, batch run schedule, and any advanced properties that are needed to form the connection. The Matillion Data Loader UI provides an easy process to set up a data pipeline.
This article walks you through the steps required to configure and manage a pipeline. There will be slight variations in this process depending on whether you are using a Batch or CDC pipeline.
- You require active accounts with required permissions on the source and the destination systems. Read Sources and Destinations.
- We recommend that you use the Matillion Data Loader dashboard's region selector to choose your region for your UI before you start building your pipeline.
When you log into Matillion Data Loader via Matillion Hub for the first time, a welcome page is displayed, inviting you to create your first pipeline by clicking Add pipeline on the dashboard page.
Before you build your first pipeline, please select your region using region selector available on the bottom right of the page.
- Right-clicking a region in the Matillion Data Loader will take you to that region's specific URL, and any data that is saved (such pipeline definitions and agent definitions) will be stored against that region.
- The Matillion Data Loader dashboard region is set to US by default.
The source database is the one containing the data you want the pipeline to extract. Matillion Data Loader supports many diverse sources, displayed as tiles on the screen. Click the source you want to connect to.
Each pipeline can only support one source database at a time.
Some sources will support either CDC or Batch processing but not both. To filter the list of sources to show only those available to the processing you intend to use, click CDC or Batch Data under the Load data type heading at the left of the screen.
Not all sources are compatible with all destinations. To filter the list to show only the sources you may use with a specific destination, click the destination name under the Supported destinations heading at the left of the screen.
Choose data loading process
Sources may allow Batch Load Replication, Change Data Capture (CDC), or both, as options to ingest your data. If your chosen source allows both, you will be presented with a screen to select which method you want to use.
Read sources for a list of supported sources for CDC and Batch data loading process.
If you are creating a Batch pipeline, you will be taken to the Connect to page. If you are creating a CDC pipeline, you will first be taken to the Choose an agent to manage your pipeline page, where you will have to choose or create an agent.
Choose or create CDC Agent (CDC pipelines only)
The Choose an agent to manage your pipeline page shows all agents that you have configured. If you have an existing agent that has a Connected status but hasn't yet had a pipeline assigned, you can click Add pipeline next to that agent. Otherwise, you will have to create a new agent.
To create a CDC agent, read Agent Setup UI.
When the agent is created and has a Connected status, you will see the Add pipeline button next to it in the list of agents. Click this to add a pipeline to the agent.
To add a pipeline to the agent, you will be taken to the Connect to page.
Connect to source
On the Connect to page, you must provide the information needed to connect to that source. The configuration details will vary for different sources, so read to the appropriate documentation for the chosen source, listed under sources.
Choose which tables from the data source to use.
Data sources such as spreadsheets don't use tables, and will therefore have different configuration requirements which will be described in the documentation for the source, listed under sources.
Use the arrow buttons to move tables to the Tables to extract and load listbox and then reorder any tables with click-and-drag. Select multiple tables using the
Click Continue with X tables to move forward.
You can then choose individual columns from each table to include in the pipeline. By default, Matillion Data Loader selects all columns from a table. Click Add and remove columns to change the list of columns. Using the arrow buttons to move columns out of the Columns to extract and load listbox and then reorder columns with click-and-drag. Select multiple columns using the
Additionally, you can set a primary key and assign an incremental column state to a column.
Click Done adding and removing to continue and then click Done.
Click Continue once you have configured each table.
Select the cloud data warehouse which your extracted data will be sent to, and enter the connection details required by that destination.
For details of the supported destinations and how to connect to them, read Destinations Overview.
You can configure multiple pipelines with the same destination. You can also replicate data from multiple sources into the same destination.
Give your pipeline a unique name, so you can identify it later.
Set pipeline frequency (Batch pipelines only)
You must set a Frequency, which defines how often the pipeline will run (pull data from the source). Specify the frequency in days, hours, or minutes, as follows:
- Day: Input a value from 1—7.
- Hour: Input a value from 1—23.
- Minute: Input a value from 5—59.
The pipeline will run as soon as it is first created, then run again on the specified schedule. For example, a pipeline created at 9:00am with a frequency of two hours will run immediately at 9:00am, then again at 11:00am, 1:00pm, and so on indefinitely.
You can't set a specific start time for the first pipeline run; it will always run as soon as it's created. For example, if you want the pipeline to run daily at a specific time of day, you must create the pipeline at that time of the day with a frequency of one day.
The scheduled delay is from start time to start time, for example a 10-minute frequency will mean the second run begins 10 minutes after the first run begins, not 10 minutes after it completes. However, a new pipeline run cannot begin until the previous run has completed. All subsequent runs will be skipped if an earlier run is still in progress. For example, a pipeline is scheduled to run every 10 minutes but takes 12 minutes to complete. If the first run occurs at 09:00, the 09:10 run will be skipped, and the next run will occur at 09:20 as normal. Therefore, you must be aware that trying to move large amounts of data at short frequencies may not adhere to the schedule you have set.
If a pipeline is unable to run when scheduled due to some external factor, for example an outage in the source system, the pipeline won't retry until the next scheduled time. For example, an hourly schedule prevented from running at 10:00am because there is a source outage won't run until 11:00am.
Pipeline Summary (CDC pipelines only)
Review the selections you have made in each of the previous stages. The summary is divided into the following sections:
- Agent Details
- Source Details
- Selected Tables
- Destination Details
- Pipeline Settings
You can return to any earlier stage to make adjustments if required. If you are satisfied with your selections, click Create Pipeline to complete the process.
All pipelines (Batch and CDC) are displayed on the Pipelines dashboard.
The following table describes the elements in the above illustration:
|1||Title header||Displays the title with list of pipelines created.|
|2||Add pipeline||Click Add pipeline to create a new pipeline.|
|3||Search||Search for pipelines by typing a partial or complete name of the Pipeline.|
|4||Filter||Filter the list of pipelines by Source (in the first dropdown field), Destination (in the second dropdown field) and pipeline Status (in the third dropdown field). The default is to show all results.|
|5||Pipeline list||Displays the list of Pipelines created by the current user. Each row in the list provides a brief summary of an existing Pipeline. The Pipeline Summary includes the Name of pipeline ,the Source from where the data is being fetched (and whether this Pipeline is Batch or CDC), the Destination where data is being loaded, and the current status of the pipeline.|
|6||Sort||Click the arrows in the column headers to sort the list.|
|7||Pipeline detail (CDC pipeline)||Click ... to open a dialog with more pipeline details. This includes the Agent name it's associated with, and the pipeline's Throughput. You can also Stop/Delete pipelines if necessary. If the pipeline is in a streaming state, you can also use Start to restart it.|
|8||Pipeline detail (Batch pipeline)||Click ... to open a dialog with more pipeline details. This includes the Frequency you have setup, Last Sync information about the pipeline, and Rows moved in the pipeline run. You can also Edit/Delete pipelines if necessary.|
When deleting and recreating a CDC pipeline, you must clear out the files that the pipeline places in your cloud storage. If you don't, the new pipeline will recognize the existing
offset.dat file and will therefore skip the snapshot phase.
Batch pipeline status
A batch pipeline will have one of the following status codes:
- Active: The pipeline is active, and scheduled.
- Paused: This pipeline is not scheduled to run.
- Running: This pipeline is actively running a schedule right now.
- Setting Up: This pipeline is in a set-up phase and will move into Running state.
CDC pipeline status
A CDC pipeline will have one of the following status codes:
- Unavailable: The Pipeline's agent is not currently connected. This may indicate a fault in the agent's installation.
- Not running: The Pipeline's agent is connected but is stopped and not running currently.
- Snapshotting: The pipeline is performing a snapshot to get the initial database state required for CDC to start.
- Streaming: The pipeline is streaming change records to cloud storage.