Related Topic: Parallelism with Matillion ETL for Redshift
This article addresses the limits of Matillion ETL Job queues and concurrently run Jobs. An overview of Matillion ETL's process of running Jobs is shown below:
When a Job is submitted in Matillion ETL via Message Queues, the API, or the Scheduler, it is queued via the quartz job scheduler. This is an unbound queue, meaning that there is no imposed limit to the size that this queue can grow to.
Jobs are then executed by the task runner, and a single Matillion ETL instance will run up to 16 Jobs concurrently. Multiple runs of the same Job will queue behind each other; for example, running one Job 16 times will run those Jobs sequentially, whereas running 16 different Jobs will run all of those Jobs concurrently.
This can become a problem when there are multiple instances of a Job being submitted, especially if the Jobs are long-running, because they can conflict with a scheduled Job and prevent it from running on its scheduled time. In these types of scenarios, it is recommended to use a micro-batching pattern.
An important exception to note is that Jobs that are run through the Matillion ETL user interface, and validation tasks, are submitted directly to the task runner, not via the quartz job scheduler. This means users can bypass any queued Jobs and run immediately if the concurrency limit is not reached.
In Matillion ETL, the task runner handles validation tasks, and queues these tasks behind any currently running Jobs. With this in mind, users should note that once the maximum concurrency limit is reached, any development performed on the same instance may experience long validation times. If this does occur, users may find benefit in separating their production and non-production workloads across unique instances.
When a Job is running, it is provided with a pool of threads to run its tasks. The number of threads in each pool defaults to 2 *n_processors. For more information on see our article on instance sizes.
Clustered instances are not relevant to Matillion ETL for BigQuery.
A cluster is considered two or more single instances. Each instance has its own task runner, and will pick up Jobs from a shared quartz queue. Therefore, a two-node cluster will be able to run 32 Jobs concurrently (16 * 2) with up to 16 on each node.
The distribution of tasks across nodes can be considered essentially random. However, a node that already has a full queue of 16 tasks will not pick up further tasks until there is at least one free task runner.
When a Matillion ETL Job runs on an HA cluster and a node fails (for example, due to a network failure, an instance crash, etc.) then within a few seconds the failure will be detected by another node and the Job re-submitted from the start. It is important to make your Jobs idempotent, so that they can handle these scenarios.
Jobs executed through the Matillion ETL user interface are executed by whichever node picks up the command.
Jobs executed via the scheduler, message queue, and API are submitted to the quartz scheduler, which distributes the tasks across the cluster randomly.
The queuing of Jobs with the same name applies per-node. For example, if two runs of the same Job end up on the same node, the second will queue behind the first; if two runs of the same Job end up being run on different nodes, they will run concurrently. This may cause deadlocks, and should generally be avoided by adhering to the advice given in this article.
Note: It's never safe to permit multiple copies of a parameterized Transformation job to run in parallel, because there is only one copy of each variable. We highly recommend using Shared Jobs in these cases, instead.
Sub Jobs and Shared Jobs
When using a Run Orchestration component or Shared Job, these sub Jobs do not queue, since the queuing is done at the top level. The important thing to note here is that a Run Orchestration component can reference the same job multiple times from different parent jobs; however, they will all reference the Job using the same Job ID. Because of this, only one instance can run at a time. Others will queue behind the currently running instance in the task runner. When running a Shared Job, this behaviour is different, and each instance of a shared Job has a unique ID, meaning multiple instances of the same shared Job can run concurrently.
Matillion's Multiple Environment Connections feature lets users run multiple ETL Jobs across multiple connections. For more information, read our Multiple Connections documentation.