Schema Drift
  • Dark

Schema Drift

  • Dark


Schema Drift occurs when your data source changes its metadata. On the fly, fields, columns, and types can be added, removed, or changed. If you don't account for schema drift, your data flow becomes vulnerable to changes in the upstream data source.

Matillion Data Loader pipelines can easily adapt to these schema changes and propagate them to the destination, without assistance from IT or the need to code.

These "self-healing" pipelines not only save time, they significantly reduce the number of job failures and delays due to schema drift. Users can ensure that data continues to flow, uninterrupted, from source to target.

Schema Drift Support

Most pipelines include Schema Drift support. Schema Drift is what occurs when changes are made to the source data or API settings, thereby creating a difference between the source and your configuration in the pipeline. If these changes are not handled appropriately, the pipeline can fail or process incorrect information. As such, pipelines that have Schema Drift support will handle changes to the source data and continue operating. You may still need to make changes if you notice schema drift occurring. The table at the bottom of the page shows the full list of supported data sources.

Schema Drift support handles the following changes to source data:

For Batch Pipelines

  1. Missing columns: Missing columns are loaded as NULL.
  2. Data type changes: See Data Type changes section below for more detail.
  3. Missing tables: Missing tables are not loaded. All other tables will continue to be loaded. You will need to update your pipeline settings to remove the missing tables.
  4. Newly added tables and fields: The pipeline does not automatically add new tables/fields to the data source to your pipeline. The pipeline is not otherwise affected, and you can manually add new tables and fields to your pipeline as required.

For CDC Pipelines

Matillion ETL shared jobs are available that handle schema drift, which will accommodate the conditions below. The agent loads data to your chosen cloud storage area which is then transformed to accommodate these conditions. If you intend to transform the data from your chosen cloud storage area yourself, the below are useful points to consider.

  1. Adding a Table: You must create a new pipeline to add a new table. The ability to edit a pipeline will be considered in a future release.
  2. Delete a Table: If a table is deleted/dropped in your source, it is not removed from the destination platform and will be retained for historical purposes. You can manually drop this table if needed.
  3. Adding a Column: The pipeline will pick up any new columns added to selected tables if the new column contains data.
  4. Deleting a Column: If a column is dropped from the source table, it will remain in your target table. However, it will not be loaded - historical data will be retained and from the point of deletion, the values for this column will be inserted as NULL.
  5. Data Type change: See below Data Type changes section for more detail. Same as for Batch pipelines, if you use the Matillion ETL Shared jobs.
  • In a batch pipeline, you select tables and columns, but in a CDC pipeline, you exclusively select tables. As a result, any columns that are added or removed when the data changes will be visible.
  • Changes are driven by data rather than by DDL. As a result, any changes you make to the schema of the source tables will not be reflected until the data changes. This encompasses new (empty) tables and new (empty) columns.

Data Type Changes

Data Type changes are accommodated, but if these are not compatible changes for the target cloud platform, the current column will be renamed as column_name, _datetime , and the column re-purposed as the new datatype. The format of the datetime extension is mmddhhmmss , for example, 20210113110334 and will be the same for all columns in the same table in the same Pipeline execution. The new column will be NULL up to the date of change - this should be considered for downstream dependencies such as views, reports, etc.

Supported Pipelines

Pipeline Batch schema drift CDC schema drift
AWS Redshift Yes N/A
Bing Search Yes N/A
Cassandra No N/A
Couchbase No N/A
Dynamics365 Sales Yes N/A
Elasticsearch No N/A
Eloqua (Oracle) Yes N/A
Excel Yes N/A
Facebook AdAccounts Yes N/A
Facebook AdInsights Yes N/A
Facebook Content Insights Yes N/A
Gmail Yes N/A
Google AdWords Yes N/A
Google Analytics Yes N/A
Google BigQuery Yes N/A
Google Custom Search Yes N/A
Google Sheets Yes N/A
Greenplum No N/A
HubSpot Yes N/A
IBM DB2 Yes Yes
Jira Yes N/A
Magento No N/A
Mailchimp No N/A
MariaDB Yes N/A
Marketo Yes N/A
Microsoft SQL Server Yes No
MongoDB Yes N/A
MySQL Yes Yes
NetSuite Yes N/A
OData Yes N/A
Open Exchange Rates No N/A
Oracle Yes Yes
PostgreSQL Yes Yes
QuickBooks Online No N/A
Sage Intacct Yes N/A
Salesforce Yes N/A
Shopify Yes N/A
Snowflake Yes N/A
Stripe Yes N/A
SugarCRM Yes N/A
Twitter Tweets Yes N/A
Xero Yes N/A
Zendesk (Query) Yes N/A
Zendesk Talk No N/A

What's Next