Designing a Job for a High Availability Cluster
Designing a job for HA
When a Matillion ETL Job runs on a high availability (HA) cluster and a node fails (because of problems with network failure, instances crashing, etc.) and the Job is invoked by the Matillion ETL job scheduler, then within a few seconds the failure will be detected by another node and the job re-submitted from the start. Jobs that are invoked any other way (through the UI, via API or SQS) are not restarted.
Making sure your job is suitable for execution in such an environment revolves around 2 concepts:
- Transaction Control
Either concept, or more commonly a mixture of both, will result in durable jobs that will complete even in the event of total failure of an instance.
The idea of a database transaction is that if an error occurs, data is rolled-back to before the transaction began. A transaction can span many operations, and on Redshift this includes DDL statements as well as DML operations. Only if all operations succeed will any other database users see the changes made during that transaction. Matillion implements transactions as separate Begin, Commit and Rollback components - if you do not use any such components in your job, then each individual statement behaves as if it were it’s own transaction.
Rollback may occur on failure of the instance, a network outage or by running the Rollback component.
Advantages of Transaction Control
- It’s simple to implement
- It’s handled by the target database
Disadvantages of Transaction Control
- TRUNCATE is not safe - in Matillion, during a transaction, Truncate is implemented as “DELETE FROM” and will then require a vacuum to reclaim unused space.
- While the transaction is open, both old and new data must be kept. (Other users see the old data until a commit.)
- Only affects the database data, and not S3 (for example)
The concept of idempotence is that if a job fails and is re-run, it doesn’t matter where it failed since re-running the job will get things into a consistent state.
For example, if a staging table is truncated first, then a file loaded into it, then transformed and finally used to update a target table, then it doesn’t really matter how far it progressed up to the final table update because re-running the job from the start will have the same impact overall. So provided the final table update is done inside a transaction (table update actually executes multiple statements), the rest of the job does not have to.
Advantages of Idempotence
- On failure, re-run with the confidence nothing can get worse
- A good design principle is to design for failure
- It works for any resource, not just the target database.
Disadvantages of Idempotence
- You may need additional components that appear to be unnecessary. For example, truncating a staging table which should, in normal circumstances, already be empty seems wasteful.
- It can sometimes be quite difficult to get just right.
Knowing that a job can simply be re-run in the event of a failure is a good safety net, and even without High Availability failover (where the re-run it automatic) is still a good design goal to strive for. It makes supporting the job in production easier because devops engineers won’t need to know what the jobs do.