Accessing files in S3 using Pre-signed URLs

Accessing files in S3 using Pre-signed URLs


Overview

Typically one cannot access files in S3 unless they have ownership of that file, it's been made public or has been shared with other IAM users. However, using Amazon S3 SDK, a user with access to the file can generate a pre-signed URL which allows anyone to access/download the file. This is ideal for use with software applications/process which need brief/momentary access to the file to consume its contents.

Please be cautious when sharing a pre-signed url.

By default, a pre-signed URL is valid for 3600 seconds(or 1 hour). We’d recommend using a much shorter duration depending on the amount of time you need for your process/user to consume the file.

See the Amazon Documentation on using pre-signed URLs.


Generate a pre-signed URL

We’ll use the Python script component and the supported boto3 package to generate a pre-signed URL. The python script component already has access to the aws credentials assigned to the instance. boto3 will use these to generate the URL for a resource/file in S3.

The python snippet below generates a URL (_uri) and assigns it to the project-variable s3_uri which can then be used in the job to access the file. Initialise variables bucket, file_key and uri_duration as appropriate.

import boto3

bucket = 'mtln-public-data' # name of the s3 bucket

file_key = 'Samples/books.xml' # key including any folder paths

uri_duration = 10 #expiry duration in seconds. default 3600

s3Client = boto3.client('s3')

_uri = s3Client.generate_presigned_url('get_object', Params = {'Bucket': bucket, 'Key': file_key}, ExpiresIn = uri_duration)

context.updateVariable('s3_uri', _uri)


Please note, the URL will contain enough information to allow anyone access to the file. By default, the URL is valid for 3600 seconds but the script limits it to just 10 seconds via the uri_duration variable - change it if necessary.

 

Example

Below is a workflow which loads a XML/Json file from S3 into Amazon Redshift. The Python script generates a pre-signed URL for the file and the API Query component loads the file into Redshift.

Files are attached and available for download at the bottom of this article. This includes the Matillion ETL job file rs_presigned_url_job.json and the airports.json and airports.rsd files.

To run this example:

  1. Download and import the Job into matillion.

  2. Create a new API Profile called Samples and then a data-source Airports. Copy RSD from the attached file, airports.rsd.

  3. Host the airports.json file in an accessible bucket.

  4. Modify the python script to change the bucketname (line 3) and file_key (line 5) to point at the file.

  5. Run the job.

 

Sample output

File in S3:

s3://mtln-public-data/Samples/airports.json

Pre-signed url:

https://mtln-public-data.s3.amazonaws.com/Samples/airports.json?AWSAccessKeyId=AKIAJSVY7VZTAUN42OMQ&Expires=1499951483&Signature=uBh3ozU8Z4pI%2B8BM3CcE29xqH%2FY%3D