Using R with Matillion ETL for Redshift

Using R with Matillion ETL for Redshift


Overview

Matillion ETL for Redshift runs as an EC2 Linux instance in your AWS VPC. You have full administrative access to the EC2 instance, and can install additional software - such as the statistical package R.

From within Matillion you can use the Bash Script component to run operating system commands, which means you can make use of software you have installed yourself.

This document demonstrates how to do this, using a sample dataset of fuel efficiency. A demonstration job has also been attached to this tutorial that can be downloaded and imported into Matillion ETL.

 

Installing R

You’ll need administrative privilege to perform the software installation, so first connect to your EC2 instance using SSH, and then issue the following commands:

sudo su -
yum install R

After successful installation, launch R as root, by entering the command “R”. From the interactive interface install the RJDBC package, by issuing the following R command:

install.packages("RJDBC", dep=T)

After installing RJDBC, you can leave the interactive R shell by typing q() and then n.

 

Sample data set

The sample data is held in a publicly-accessible Matillion S3 bucket s3://matillion-public, in an Excel 2010 file named EfficiencyExcel2010.xlsx.

The file contains only one worksheet, which is formatted with a header line. Every row contains data from one refuelling. Part way through the time series, Volkswagen performed their “defeat device” intervention, marked in the screenshot below at line 66.

The question is: did the Volkswagen intervention result in reduced fuel efficiency? To find out, we’ll use Matillion to load the data:

  • First into Redshift.

  • Then into R.

in order to take advantage of one of R’s built-in statistical functions.

 

Loading the data

The source data is held in S3 in Excel format, so you’ll need an Excel Query component in an Orchestration job first of all.

When creating the Excel Query component, you can select all the columns including RowId. This column will be useful later to differentiate between data before and after the technical intervention.


 

Preparing the data for analysis

Matillion uses Redshift’s power to prepare data for analysis, by pushing SQL statements down onto the database for analysis. There are only about a hundred data points in this example so no aggregation is needed, however the same pattern is applicable to a wide variety of analytical use cases.

This Matillion Transformation job takes the input data and splits it into two data sets (“before” and “after”), ready for consumption by the analytics package.

The “before” and “after” data sets are actually implemented as views, so the pushdown SQL won’t actually occur until the data is queried.

The main problem is how to split the records into “before” and “after”. The data set includes a marker (at row 66) to indicate first refuel after the intervention, so this is filtered out in an aggregation step:

The resulting single record is joined to the same data set again to add the RowId of the intervention. No actual join condition is needed, so you can use a constant expression such as 1=1 as the Join Expression.

Downstream of that, a Calculator adds a new column containing the text ‘Before’ or ‘After’ based on a CASE statement.

The data is then replicated and two complementary filters define the contents of the two named views.

 

Using R

Once R has been installed on the EC2 instance, it can be accessed from a Bash Script component using the “Here Document” syntax.

To run R commands, it looks like this:

cat <<EOF | R -q --vanilla

# Your commands here

EOF

The technique will be to ingest the “before” and “after” data sets from Redshift using JDBC, and to then use an independent sample t-test to check whether there is a significant difference between the two.

Matillion comes pre-installed with a JDBC driver that can communicate with Redshift. At the time of writing, this is:

/usr/share/emerald/WEB-INF/lib/postgresql-9.4-1206-jdbc41.jar

Use R’s JDBC command to specify this driver:

jdrv <- JDBC(driverClass="org.postgresql.Driver", classPath="/usr/share/emerald/WEB-INF/lib/postgresql-9.4-1206-jdbc41.jar", identifier.quote="'")

Then set up environment variables for the Redshift JDBC endpoint, username and password. You can copy and paste the JDBC URL from the AWS Redshift console. Note in this case we’re using the Postgres JDBC driver, so use:

jdbc:postgresql://

Instead of:

jdbc:redshift://

In your URL.

Once a JDBC database connection has been established, you can create two data frames containing the records for comparison.

bef <- dbGetQuery(conn, "SELECT mpg FROM v_eff_before")
aft <-dbGetQuery(conn, "SELECT mpg FROM v_eff_after")

Note that it’s best to keep the size of these data frames as small as possible, since the runtime memory is limited. If you have large amounts of input data, make Redshift do the heavy lifting by performing all the necessary aggregations and filters inside your view definitions.

 

Testing for statistical significance

Once the data frames are loaded, there’s a simple R function to perform the independent-sample t-test.

t.test(as.numeric(bef\\$mpg), as.numeric(aft\\$mpg))

Output from this function call includes the confidence intervals and the p-value. This figure represents the probability (between 0 and 1) that there is really no difference between the mean values of the two data sets.

In this particular example, the p-value of 0.7428 is far beyond the boundary of significance, so we can conclude that Volkswagen’s intervention had no detectable impact upon fuel efficiency.