Loading Task Information
  • Dark
    Light
  • PDF

Loading Task Information

  • Dark
    Light
  • PDF

Overview

It can be useful to take the task information from Matillion ETL and load it into a table and/or to preserve it on in file storage, allowing the user to run diagnostic or logistical analysis of the jobs that have been run.

Task Information is available within Matillion ETL by clicking Project and then clicking Task History.

Using the Matillion API, it is possible to have the task information returned as a JSON string. Meanwhile, the API Query component in Matillion ETL can be used to create the necessary API and return the information into a table.

Please Note

Please authenticate as a user with API privileges if you wish to work with the Matillion API.

For more information, refer to the following documentation:



Task Information API

The Matillion API can be used to take task information as a JSON string. The format is as follows:

curl -X GET "http://<InstanceURL>/rest/v0/tasks?groupName=<GroupName>&projectName=<ProjectName>&versionName=<VersionName>&environmentName=<EnvName>&jobName=<jobName> "

For example:

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping&versionName=default&environmentName=Live&jobName=Diagnostics "

The result is a JSON string that gives information on currently running tasks being run under the specified Group, Project, Version, Environment, and Job. Parts of the API can be omitted to give a broader search. For example, to gather any tasks under the project of the above example:

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping "

Note that this is only for currently running jobs. To take a task history, you need to specify to find jobs that are not currently running with running=false. The following API call is amended to search for the task history.

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping&running=false"

This will return a task history in the form of JSON data, but will not create a file, nor import the information into a table.

To do this, we can make use of the API Query component.

You can do this by creating a new Orchestration Job and then clicking on and dragging the API Query component onto the canvas from the Components panel.



Example

In this example, the API Query component is used to import Matillion ETL task information, filter out any failed jobs, convert the dates into a more readable format, and finally save the remaining entries into table that will then be stored on an S3 Bucket.

Important Information

  • To use Matillion API, you must provide the intended "AuthScheme", "User", and "Password" for the API call.
  • Add the following entries to the Connection Options property of the API Query component to connect with the API.
    • AuthScheme = BASIC
    • User = <your Matillion ETL instance username>
    • Password = <your Matillion ETL instance password>
  • For versions of Matillion ETL when the REST driver was updated from the 2016 driver to the 2019 driver, users may encounter issues where data is not being flattened or is displaying as a single row, rather than multiple values. To fix this, set a connection option expressing the following: BackwardscompatibilityMode = true.

The API Query component is set up to use the table "task_table" that has been created in the Create/Replace Table Component. The component is set up to use the Matillion API included in Matillion ETL and the data source is specified as "Run History Details" (details can be found through ProjectManage API ProfilesManage Query Profiles).



Extracting Task History Via API

This component uses an API call equivalent to the one below...

http://docker.dc.matillion.com:32815/rest/v0/tasks?projectName=DelMe&running=false

...to deliver a task history in JSON form, such as the snippet below:

{
"type": "VALIDATE_ORCHESTRATION",
"jobID": 1057,
"jobName": "Regression Pack",
"jobRevision": 2,
"jobTimestamp": 1485246350127,
"componentID": 1062,
"componentName": "Orc/Trans Regression suite",
"state": "SUCCESS",
"rowCount": -1,
"startTime": 1485246445284,
"endTime": 1485246445313,
"message": ""
},
{
"type": "VALIDATE_ORCHESTRATION",
"jobID": 3511,
"jobName": "Run Tests",
"jobRevision": 2,
"jobTimestamp": 1485246351156,
"componentID": 3540,
"componentName": "Start 0",
"state": "SUCCESS",
"rowCount": -1,
"startTime": 1485246445327,
"endTime": 1485246445338,
"message": ""
},

The JSON is not particularly accessible when wanting to look over a large number of jobs by eye. Thankfully, the API Query component will take this JSON and reformat it into a table, according to the specified profile. These profiles can be found through ProjectManage API ProfilesManage Query Profiles.

As indicated by the component's properties, the example is using the Matillion API and the Run History Details RSD file contained within. This RSD defines the conversion of the JSON data into a table format.



Manipulating The Task History

The API Query component stores the job data in a table that a Transformation Job can use. In the Transformation Job, the data is filtered to take only successful runs, and then has two SQL components that reformat the start and end times. Finally these data streams are joined and columns are selected to output to a table.



Explanation

The "task_table" contains the returned table from the API Query component—task data that has been taken from the JSON string that the Matillion API returns.

Using a Filter component

It can be seen that several jobs have failed and for our report, we're concentrated on the successful runs. The best way to remove the failed rows is to add a Filter component and set it up to filter any rows where the "state" column is not "SUCCESS".

This filter returns only successful rows from task_table.

Using SQL to reformat Task History

Now, we can use the SQL component to change the dates to a readable format using the two SQL lines, one in each component:

Snowflake:

SELECT DATEADD(MS, "starttime", 0::TIMESTAMP) as startdate ,* from task_table;
SELECT DATEADD(MS, "endtime", 0::TIMESTAMP) as enddate ,* from task_table;

Amazon Redshift and Google BigQuery:

SELECT TIMESTAMP 'epoch' + convert(bigint, starttime/1000) * INTERVAL '1 Second ' as startdate ,* from task_table
SELECT TIMESTAMP 'epoch' + convert(bigint, endtime/1000) * INTERVAL '1 Second ' as enddate ,* from task_table

The Join component is used to take data from these sources (termed A and B in the Join component) and return a single table.

Joining two Task History tables

The two tables are compared such that we use the "id" and "componentname" columns to ensure rows from A and B are the same so we can take the "enddate" from B and match to the correct row in A. We have mapped the new formatted columns "startdate" and "enddate" to the old timestamps "starttime" and "endtime" respectively. We can also omit any columns that are unwanted.

This data is then written to "task_table" using the Rewrite Table component. The data can be checked using the Sample tab of either the Join or Rewrite Table components.

From the sample, the "starttime" and "endtime" appear to have been reformatted correctly and the desired columns are present. This data will remain in "task_table", which will be written to an S3 bucket by the S3 Unload component in the Orchestration Job.

Attachments