Triggering ETL from a Cloud Storage Event via Cloud Functions

Triggering ETL from a Cloud Storage Event via Cloud Functions


Overview

This example links the arrival of a new object in Cloud Storage and automatically triggers a Matillion ETL job to load it, transform it and append the transformed data to a fact table.

This approach makes use of the following:

  1. Google Cloud Storage triggers.
  2. Google Cloud Functions (beta).
  3. Matillion ETL REST API.


Files Landing in Google Cloud Storage

A file could be uploaded to a bucket from a third party service, copied using gsutil or via Google Cloud Transfer Service.
 


Trigger a Google Cloud Function

The diagram below outlines the basic architecture.

 
  1. A file gets written to the Cloud Storage Bucket.

  2. A Cloud Storage event is raised which in-turn triggers a Cloud Function. The function is passed some metadata about the event, including the object path.

  3. The Cloud Function issues a HTTP POST to invoke a job in Matillion ETL passing various parameters besides the job name and name/path of the file that caused this event.

  4. Matillion ETL launches the appropriate Orchestration job and initialises a variable to the file that was passed via the API call. The job loads data from the file into a staging table in BigQuery.

  5. We then launch a Transformation job to transform the data in stage and move into appropriate tables in the Data-warehouse.
     


Google Cloud Function

Google Cloud Functions are written in JavaScript, and execute in a Node.js runtime. The service is still in beta but is handy in our use case. We will use a background cloud function to issue a HTTP POST and invoke a job in Matillion ETL.
 


Prerequisites

  • Access to a Google Cloud Platform Project with billing enabled.
  • Cloud Functions API is enabled for this project.
  • ​We also need two buckets:
    • Trigger bucket - Raises cloud storage events when an object is created.
    • Source bucket - Holds the code and other artifacts for the cloud functions.
  • Credentials of a Matillion ETL user with API privilege


Prepare the code

Download the function code archive(zip) attached to this article. The file index.js contains parameters we need to adjust prior to creating our Cloud Function. These parameters identify the Matillion ETL API endpoint, credentials to connect and details of the job to launch. Adjust there accordingly and re-package the files index.js and package.json into a zip file. We will upload this archive in Step 5 of the next section.

//Matillion ETL service parameters
const mtln_url = "http://11.111.111.111/rest/v0/tasks";
const mtln_api_user = "api-user";    //user with API privilege
const mtln_api_pass = "user-password";  // password for api user

Where 11.111.111.111 is a dummy IP to be replaced by your own Matillion ETL instance address.

// Matillion ETL job details
const groupName = 'Sales Demo';
const projectName = 'Your Project Name';
const versionName = 'default';
const environmentName = 'DEV';
const jobName = 'RunETL';


Creating the Cloud Function

  1. Go to Cloud Functions Overview page in the Cloud Platform Console.
    Make sure that the project for which you enabled Cloud Functions is selected.

  2. Click Create function.

  3. Name your function. For e.g.  mtln_file_trigger_handler

  4. In the Trigger field, select Cloud Storage Bucket and select a bucket that should invoke this function every time an object is created.

  5. Select ZIP upload under Source Code and upload the archive created in the previous section.

  6. Select the Stage Bucket that will hold the function dependencies.

  7. Set Function to Execute to mtln_file_trigger_handler

  8. Click Create.

Here’s what the Create Function screen may look like:

 

Now that the function is ready, let's look into the job-workflow. You may import the JSON file using Project→Import menu item.

 

Trigger an ETL job to extract, load and transform it.

Here is the Matillion ETL job that will load the data each time a file lands. It maintains the target table, and on each run truncates it and loads the latest file into it. It then runs a data transformation on the loaded data which adds some calculated fields, looks up some details of the airline and airport, and finally appends the results to the final fact table.

The exported job and data files are available at the bottom of this page.

Each time this runs we want to load a different file. So we can define a variable for that - the function in index.js passes a variable named “file_to_load”, so we should define that within Matillion ETL and provide a default value we can use to test the job. To do this:

  1. Select Project→ Edit Environment Variables.

  2. Create an environment variable as below.

 

This is referenced in the component Load Latest File (a Cloud Storage Load Component) as the Google Storage URL Location parameter. In this case, the entire path to the file is provided by the Cloud Function.

All variables must have a default value so the job can be tested in isolation.

Now you are ready to add some files into the bucket and trigger the Job. You can see the job executing in your task panel or via Project→ Task History.