Schema Drift
  • Dark
    Light

Schema Drift

  • Dark
    Light

Overview

Schema Drift occurs when your data source changes its metadata. On the fly, fields, columns, and types can be added, removed, or changed. If you don't account for schema drift, your data flow becomes vulnerable to changes in the upstream data source.

Matillion Data Loader pipelines can easily adapt to these schema changes and propagate them to the destination, without assistance from IT or the need to code.

These "self-healing" pipelines not only save time, they significantly reduce the number of job failures and delays due to schema drift. Users can ensure that data continues to flow, uninterrupted, from source to target.


Schema Drift Support

Most pipelines include Schema Drift support. Schema Drift is what occurs when changes are made to the source data or API settings, thereby creating a difference between the source and your configuration in the pipeline. If these changes are not handled appropriately, the pipeline can fail or process incorrect information. As such, pipelines that have Schema Drift support will handle changes to the source data and continue operating. You may still need to make changes if you notice schema drift occurring. The table at the bottom of the page shows the full list of supported data sources.

Schema Drift support handles the following changes to source data:

For Batch Pipelines

  1. Missing columns: Missing columns are loaded as NULL.
  2. Data type changes: See Data Type changes section below for more detail.
  3. Missing tables: Missing tables are not loaded. All other tables will continue to be loaded. You will need to update your pipeline settings to remove the missing tables.
  4. Newly added tables and fields: The pipeline does not automatically add new tables/fields to the data source to your pipeline. The pipeline is not otherwise affected, and you can manually add new tables and fields to your pipeline as required.

For CDC Pipelines

Matillion shared jobs are available that handle schema drift (for Snowflake and Redshift currently), which will accommodate the conditions below. The agent loads data to your chosen cloud storage area which is then transformed to accommodate these conditions. If you intend to transform the data from your chosen cloud storage area yourself, the below are useful points to consider.

  1. Adding a Table: You must create a new pipeline to add a new table. The ability to edit a pipeline will be considered in a future release.
  2. Delete a Table: If a table is deleted/dropped in your source, it is not removed from the destination platform and will be retained for historical purposes. You can manually drop this table if needed.
  3. Adding a Column: The pipeline will pick up any new columns added to selected tables if the new column contains data.
  4. Deleting a Column: If a column is dropped from the source table, it will remain in your target table. However, it will not be loaded - historical data will be retained and from the point of deletion, the values for this column will be inserted as NULL.
  5. Data Type change: See below Data Type changes section for more detail. Same as for Batch pipelines, if you use the Matillion Shared jobs.

Please Note

  • In a batch pipeline, you select tables and columns, but in a CDC pipeline, you exclusively select tables. As a result, any columns that are added or removed when the data changes will be visible.
  • Changes are driven by data rather than by DDL. As a result, any changes you make to the schema of the source tables will not be reflected until the data changes. This encompasses new (empty) tables and new (empty) columns.

Data Type Changes

Data Type changes are accommodated, but if these are not compatible changes for the target cloud platform, the current column will be renamed as column_name, _datetime , and the column re-purposed as the new datatype. The format of the datetime extension is mmddhhmmss , for example, 20210113110334 and will be the same for all columns in the same table in the same Pipeline execution. The new column will be NULL up to the date of change - this should be considered for downstream dependencies such as views, reports, etc.


Supported Pipelines

Pipeline Schema Drift
AWS Redshift Yes
Bing Search Yes
Cassandra No
Couchbase No
Dynamics365 Sales Yes
Elasticsearch No
Eloqua (Oracle) Yes
Excel Yes
Facebook AdAccounts Yes
Facebook AdInsights Yes
Facebook Content Insights Yes
Gmail Yes
Google AdWords Yes
Google Analytics Yes
Google BigQuery Yes
Google Custom Search Yes
Google Sheets Yes
Greenplum No
HubSpot Yes
IBM DB2 Yes
Jira Yes
Magento No
Mailchimp No
MariaDB Yes
Marketo Yes
Microsoft SQL Server Yes
MongoDB Yes
MySQL Yes
NetSuite Yes
OData Yes
Open Exchange Rates No
Oracle Yes
PostgreSQL Yes
QuickBooks Online No
Sage Intacct Yes
Salesforce Yes
Shopify Yes
Snowflake Yes
Stripe Yes
SugarCRM Yes
Twitter Tweets Yes
Xero Yes
Zendesk (Query) Yes
Zendesk Talk No