S3 Manifest Builder

S3 Manifest Builder


This article is specific to the following platforms - Redshift.

S3 Manifest Builder

Build and upload a manifest file which describes a set of files (S3 objects) to load based upon a pattern using a regular expression.

If the pattern is simply a prefix, this is not required as the S3 Load can use a prefix directly. However, for more complicated file-naming schemes, use this component to build a manifest file first, and then use an S3 Load with the manifest option and a path pointing to the generated manifest file.


Properties

Property Setting Description
Name Text The descriptive name for the component.
Source S3 Path S3 Tree The S3 bucket (and optionally the path within it) to search.
Manifest Path S3 Tree The S3 location to write the manifest file to.
Manifest File name Text The file name for the manifest.
Filter Regular Expression Regular Expression This is a regular expression. All names found from Source S3 Path are tested against this expression, and if they match they are included in the manifest.
This requires an exact match, so if you are searching for text anywhere within the name you will need to include wildcards before and after the text.
Mandatory Select Yes writes the Mandatory option into the manifest, meaning the subsequent S3 load will fail if the file is not present. No makes the file optional, and no error will be raised if the file is missing.

Variable Exports

This component makes the following values available to export into variables:

Source Description
Matched Objects The number of objects matching the filter pattern and written to the Manifest.

Example

In this example, we want to load multiple data files into a table. Each file has a very different name which makes it difficult to load them all at once using S3 Load. To overcome this issue, we will build a manifest of the files using the S3 Manifest Builder component. This manifest can then be read by S3 Load to load multiple files at once. The Orchestration job is shown below.

To begin, we create a new table and assign it column names according to the data we are expecting using the Create/Replace Table component. Next, we use the S3 Manifest Builder component to create a new manifest. Below, we show how the properties for this component are configured.

The Source S3 Path is the path of the files we want to load. The Manifest Path is the path for the manifest file to be saved to. In our case, these are both the same path but it does not have to be so. We name the manifest file 'crime.manifest' - the name is not important but the '.manifest' extension is required. Finally, we use a regular expression to find the files we want to load. In this case, we want to load all gzipped files so anything with 'gz' in the filename is fine. The Mandatory option is personal preference.

Now we are ready to move on to the S3 Load component, the properties of which are as so:

Some properties have been excluded from the image for the sake of clarity. Importantly, the 'Manifest' property has been set to "Yes" and the S3 URL Location and S3 Object Prefix are pointing to the manifest file rather than the .gz files to be loaded. It should also be noted that we have set the Compression Method property to Gzip as the files listed in the manifest have this compression.

With S3 Load configured, this job can now be run. Since manifests typically involve a lot of data, the load might take some time.

When complete, we can sample the table we loaded data into using the Table Input component in a Transformation job.