Making Multiple API Queries
Matillion’s API Query Component is ideal for pulling data from a REST or SOAP API however there are cases where using a python script to make the API call may be more efficient.
Suppose we have a list of addresses, in this case all English football stadiums, for which we require the corresponding longitude and latitude. This can be achieved using the Google Maps API which takes an address and returns the longitude and latitude in a JSON format. However a limitation of this Google Maps API is that it only allows one address per API query. So if you have 100 addresses you will need to make 100 API queries.
This can be achieved in Matillion by configuring the API profile and using the API Query component with a table iterator. However it will create 100 individual Redshift tables with one row of data in each. These tables will need to be merged on the Redshift database using a Matillion Transformation job.
An example of this job is below:
The API query component is using the rsd file attached as an API profile which is configured to return the longitude and latitude for a given address which is input. The table iterator is iterating around each address and running the API Query and writing the longitude and latitude back to an individual table per iteration. These tables are then merged together using a transformation job using a multi table input:
The final step is to delete all of the temporary tables created.
This job is clearly inefficient as the API Query component will create many files and then load each of these files into an individual table in Redshift which results in many tables in Redshift which then need to be deleted.
Another option is to do the API calls in python and append the results to one file on the Matillion EC2 instance and then write this file to Redshift so only one table is created in Redshift. An example of this job is below:
This job starts by creating a blank file and then iterating round the table with the addresses and making the API call via python. The results from the API call are then written back to a csv file. The python from the API call is here:
A bash script component then copies the local file up to S3 so it can be loaded into Matillion using the S3 Load component.
Both jobs achieve the same output but using the python script saves the load on the Redshift server and is about 30s quicker. The below task history shows the job runs. The top task is using the Matillion API Query Component, the second task is using the custom Python:
The Matillion API Query component can manage almost all API calls, however there are some exceptions where it may be better to use a Python script to do the API call.