Staging All Tables From A Database

Staging All Tables From A Database


Overview

A common request is to load all data from a cloud or on-premise database. This article looks at loading all data from a MySQL database using Matillion ETL for BigQuery but the concept is the same for all databases and warehouses. This example takes data from a database called classicmodels which has a series of tables in it containing data on classic car orders.

The core idea here is that although the Database Query component will only load one table at a time, a table iterator can be used to load all tables from a given database.



Method

The first step in this is to obtain the name of all tables in the source database. This can either be a manually maintained list, or can be obtained by querying the information schema. This query can be put in a Database Query component to write the results back to a temporary BigQuery table:


 

The result of this is a BigQuery table containing the name of all tables in the classicmodels database. This BigQuery table can be linked to a Table Iterator component in Matillion which can be used to loop through those tables. In the table Table Iterator, the column table_name needs to be mapped to a variable:

The Table Iterator is linked to another Database Query component with a simple SELECT * FROM query to pull all data from the table being iterated through. The table this is written to in BigQuery should contain the name of the variable to avoid having one table continually overwritten:

This will create a series of new tables with the contents of the classic models database which can then be used in a transformation job.

Finally, we recommend using a Delete Tables component to tidy up the classic_models_tabel used by the Table Iterator:


 

Incremental Loads

Incremental loads are often easier dealt with by using one of Matillion ETL's Incremental Load Generators.

The next step after taking a one off copy is to do incremental load on all tables. This requires checking the last modified date on the table already held in BigQuery and feeding this into a variable to be used in the Database Query Component.

This can be done using a simple Python script within a Python script component:

print table

cursor = context.cursor()
cursor.execute('select max(lastmodified) from ' + table)
lastmodified = cursor.fetchone()[0]

print lastmodified

context.updateVariable('LastModified', lastmodified)


This is putting the last modified date from the table name from the variable ${table} into a variable created called ${LastModified}


This can then be sent to the RDS Query component to only pull through the recently modified data:


 

This nested Orchestration job can then be called from within the iterator:

If desired, the staged tables can be merged into persistent tables in BigQuery using our slowly changing dimensions example here.