Helping with the GDPR

Helping with the GDPR


Overview

From 25/5/2018 any company operating in the EU will be required to comply with the new GDPR Data Protection legislation if they store or process personal information relating to an individual.

If you’re a Matillion ETL user your installation will be running within your AWS Infrastructure and so should be considered in the same way as your other EC2 instances, but is there any way Matillion ETL can help with your wider GDPR responsibilities?

Due to the ability to integrate Snowflake with AWS Services such as S3Cloud Storage, RedshiftBigQuery and SNS the answer is yes, and indeed we use some of these processes internally within Matillion.

This is of course not an exhaustive list and you should consult legal experts if you’re in any doubt about your GDPR Compliance. The below examples are part of a wider GDPR policy and set of processes, and are only shown here as an example of the kind of checks that are possible.


Technical Processes Used:

  1. Querying AWS resources such and RedshiftSnowflakeBigQuery Snapshots and Buckets from within Matillion.

  2. Querying the RedshiftSnowflakeBigQuery system tables from within Matillion ETL.

  3. Getting data from Google Spreadsheets and comparing with information from AWS.

 

Example Process

One of the main GDPR principles is ensuring that data is up to date and not kept for longer than required - as well as a legal requirement this is just good data practice. When using Matillion ETL to create a RedshiftSnowflakeBigQuery data warehouse some areas where data may be stored and become stale include:

  • In RedshiftSnowflakeBigQuery snapshots/backups.

  • In RedshiftSnowflakeBigQuery testing databases used during development.

  • Buckets (where CSV files or other files may be stored as required by your processes).

We’ll look at how Matillion ETL can help with the reporting of the above, to aid your GDPR processes. These processes report via SNS and you’ll need a RedshiftSnowflake SNS topic set up in advance.

 

Backups

Redshift will automatically take Snapshots (backups) of your clusters multiple times a day, however these will expire after a set time (configured in the AWS Console). This means as long as you delete any data that must be removed as part of GDPR even the Automatic Snapshots will eventually expire and the data be removed from your backups. 

Manual Snapshots, however, will never expire and must be manually deleted which means keeping track of them.

The attached file you can find the example job Report RedshiftSnowflakeBigQuery Snapshots that will query your RedshiftSnowflakeBigQuery clusters and alert you via an SNS topic of all of your manual snapshots:

 

See the instructions in the comment for the changes you’ll need to make to get it to work for your environment.

The majority of the logic is done within the Python component which builds up an output message (a list of snapshots) which is then passed into the SNS component to be passed to the SNS topic and onto the recipients.

The received email looks like:

 

Snowflake has two main methods of backing up data that both potentially hold onto customer data; Time Travel and Fail-safe.

Even after removing data (for example: by dropping a table), Time Travel will allow you to restore it for up to 90 days. Time Travel data retention time can be changed (and turned off) using the DATA_RETENTION_TIME_IN_DAYS parameter, detailed here.

Fail-safe is, unfortunately, not configurable but will only hold data for 7 days (beginning from when Time Travel's retention ends).

 

Testing Databases

Perhaps during development you use one of more testing Databases on a RedshiftSnowflakeBigQuery cluster. If using customer data this could be a GDPR concern if these environment aren’t cleared up after development - old data could lie around for months or years.

As such it’s a good practice to always clear down test environments after development - and an automated process to alert you to which environments are populated can help with this.

 

 

This example job 2) Report RedshiftSnowflakeBigQuery Databases with non-empty tables (and the associated Transform job 2_1) Create RedshiftSnowflakeBigQuery Report - non-empty tables) will query the RedshiftSnowflakeBigQuery system tables and report any databases with non-empty tables so you can ensure only the required environments are populated.

See the instructions in the comment for the changes you’ll need to make to get it to work for your environment.

Note: the Matillion ETL environment this runs within will need to have sufficient RedshiftSnowflakeBigQuery privileges to query the system tables for the environments needed - you may wish to run this as a root or admin user depending on your requirements.

The received email looks like:

 

Bucket File checking

In many cases you may be storing reference data in Buckets to be used by Matillion ETL jobs - perhaps this is business data that arrives in CSV form or lookups provided by external companies.

A requirement of GDPR is to know which data is stored where so it is likely that you have a log of this business data, with details of at least:

  1. The name of the file

  2. The Bucket it resides in

However, there will be a need to check that this list remains up to date, and also to ensure that any new files that appear are identified. If this information is stored in a Google Spreadsheet then Matillion ETL can be set up to access this using the Google Sheets Query componentGoogle Sheets Query component.

In this example we’ve created a Google Spreadsheet Matillion S3Cloud Storage Business Data Tracking Example with two sheets:

Sheet S3Cloud Storage_buckets shows the bucket or buckets in scope for the check:

Sheet S3Cloud Storage_files shows the known files that are expected to exist in those buckets:

Also We have populated a Bucket matillion-example with the two known files, and an extra unlisted file (highlighted):

The provided example job 3) Bucket to Spreadsheet Comparison Example will:

  1. Read the of known Buckets for checking from the Google sheet

  2. Read the list of known files from the Bucket

  3. Query S3Cloud Storage directly using Python to determine all of the files that exist

  4. Send an email of any files found in the monitored Buckets that do not exist on the spreadsheet - so these can either be added (to keep your GDPR data log up to date) or removed if they are not required.

  


As you’ve hopefully seen from the screenshots the S3 bucket we’re checking as an extra file that isn’t logged on our known data spreadsheet. When the email arrives it looks like this: