Designing a Job for a High Availability Cluster
Designing a job for HA
Note: High Availability Clusters features are available only on Matillion ETL instances launched via AWS.
When a Matillion job runs on a HA cluster and a node fails (network failure, instance crash etc) then within a few seconds the failure will be detected by another node and the job re-submitted from the start.
Note: Using Matillion ETL on HA clusters is an Enterprise-only feature, meaning it is only available to customers on m4.large and m4.xlarge instances. To migrate an instance to a cluster, consider using the Server Migration Tool or Contacting Support.
For creating a new instance on a cluster, consider using the prebuilt Matillion ETL Launch Templates.Matillion ETL Launch Templates.
Making sure your job is suitable for execution in such an environment revolves around 2 concepts:
- Transaction Control
Either concept, or more commonly a mixture of both, will result in durable jobs that will complete even in the event of total failure of an instance.
The idea of a database transaction is that if an error occurs, data is rolled-back to before the transaction began. A transaction can span many operations, and on Redshift this includes DDL statements as well as DML operations. Only if all operations succeed will any other database users see the changes made during that transaction. Matillion implements transactions as separate Begin, Commit and Rollback components - if you do not use any such components in your job, then each individual statement behaves as if it were it’s own transaction.
Rollback may occur on failure of the instance, a network outage or by running the Rollback component.
Advantages of Transaction Control
- It’s simple to implement
- It’s handled by the target database
Disadvantages of Transaction Control
- TRUNCATE is not safe - in Matillion, during a transaction, Truncate is implemented as “DELETE FROM” and will then require a vacuum to reclaim unused space.
- While the transaction is open, both old and new data must be kept. (Other users see the old data until a commit.)
- Only affects the database data, and not S3 (for example)
The concept of idempotence is that if a job fails and is re-run, it doesn’t matter where it failed since re-running the job will get things into a consistent state.
For example, if a staging table is truncated first, then a file loaded into it, then transformed and finally used to update a target table, then it doesn’t really matter how far it progressed up to the final table update because re-running the job from the start will have the same impact overall. So provided the final table update is done inside a transaction (table update actually executes multiple statements), the rest of the job does not have to.
Advantages of Idempotence
- On failure, re-run with the confidence nothing can get worse
- A good design principle is to design for failure
- It works for any resource, not just the target database.
Disadvantages of Idempotence
- You may need additional components that appear to be unnecessary. For example, truncating a staging table which should, in normal circumstances, already be empty seems wasteful.
- It can sometimes be quite difficult to get just right.
Knowing that a job can simply be re-run in the event of a failure is a good safety net, and even without High Availability failover (where the re-run it automatic) is still a good design goal to strive for. It makes supporting the job in production easier because devops engineers won’t need to know what the jobs do.