S3 Load Generator (Redshift)

S3 Load Generator (Redshift)


Overview

S3 Load Generator is a tool that helps users load delimited data from public objects in an S3 Bucket (Amazon Simple Storage Service).

Unlike common components, the Load Generator does not appear as a standalone component when dragged onto the job canvas. Instead, the Load Generator takes the form of a tool that allows users to load and view files on the fly, altering load component properties and observing their effects without the need for a separate Transformation job. The generator can also guess the schema of a table, relieving much of the end user's work.

Note: This component requires working AWS Credentials with "read" access to the bucket that contains the source data's file(s). This is easily achieved by attaching an IAM role to the instance when launching Matillion ETL for Redshift; however, it can also be managed by editing an Environment. See the example at the bottom of this article.

Furthermore, Matillion requires use of a policy that contains the s3:ListBucket action, such as the policy provided in the Managing Credentials documentation.

 

Layout

Note: Clicking on any of the images in this article will enlarge the image for readability.

When a user drags the S3 Load Generator onto the canvas, a three-page setup wizard appears (see the below images).

A file must first be selected from an S3 bucket. This file must be delimited (including .csv) and the user must have permission to access the file. If the file is compressed, the correct compression method must be selected; supported compression methods are gzip and bzip2.

If the selected file is viable, users can select the number of rows they wish to sample and click the Get Sample button. Here, the tool will load the file's data and attempt to guess its schema. Raw data for the file is displayed in the large panel below.

 

On page 2, column data is displayed in the lower-right panel and is available for editing. Schema configuration properties for the table are displayed in the lower-left panel; the available properties are detailed in the below table (article section "Properties").

Selecting the Guess Schema button will order the Load Generator to attempt to guess as many of these properties as possible. However, all of the properties are available for manual editing by the user, as is the column data in the lower-right panel.

 

When the user is satisfied with the settings, the resulting table output can be viewed by selecting the 'Test' button at the bottom of page 3 of the setup wizard. Running a test will create the requested table on the Redshift cluster and show a sample for the user to inspect. If the user is unhappy with the output, they can return to the 'Configuration' panel on page 2 to make alterations until they are satisfied with the result.

 

Properties

Property Setting Description
For more information on all the settings in this component, see the Amazon Redshift COPY syntax for more information.
Table Name Text The descriptive name of the table.
Data Type Select CSV: Comma Separated Values file, delimited by commas.
Delimited: Any file containing delimited data.
Delimiter Text (Only available if 'Data Type' is 'Delimited'. The delimiter separates columns; the default is a Comma (,). A [TAB] character can be specified as "\ ".
CSV Quoter Text Specifies the character to be used as the quote character when using the CSV option.
Region Select The Amazon S3 region hosting the S3 bucket. This is not normally required and can be left as "None" if the bucket is in the same region as your Redshift cluster.
Replace Invalid Characters Text If there are any invalid unicode characters in the data, this parameter specifies the single character replacement for them. Defaults to '?'.
Max Errors Text The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values. This value defaults to 0, but the Amazon default is 1000.
Date Format Text Defaults to 'auto' - this can be used to manually specify a date format.
Time Format Text Defaults to 'auto' - this can be used to manually specify a time format.
Ignore Header Rows Text The number of rows at the top of the file to ignore - defaults to 0.
Null As Text This option replaces the specified string with null in the output table. Use this if your data has a particular representation of missing data.
Accept Any Date Checkbox If this is enabled, invalid dates such as '45-65-2018' are not considered an error, but will be loaded as a null value.
Ignore Blank Lines Checkbox If this is enabled, any blank lines in the input file are ignored.
Truncate Columns Checkbox If this is enabled, any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit instead of causing an error.
Fill Record Checkbox If this is enabled, Matillion ETL allows data files to be loaded when contiguous columns are missing at the end of some of the records. The remaining columns are set to null.
Trim Blanks Checkbox If this is enabled, Matillion ETL removes trailing and leading whitespace from the input data.
Empty As Null Checkbox If this is enabled, empty columns in the input file will become NULL.
Blank As Null Checkbox If this is enabled, blank columns in the input file will become NULL.
Comp Update Checkbox Controls whether compression encodings are automatically applied during a COPY. This is usually a good idea to optimise the compression used when storing the data.
Stat Update Checkbox Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command.
Round Decimals Checkbox If this option is enabled, round any decimals to fit into the column when the number of decimal places in the input data is larger than defined for the target column.

 

Example

In the following example, the S3 Load Generator is used to load the contents of a small file into a table. First, the Load Generator is given the path to the file, and since the file is not compressed, we keep the Compression setting set to "None". After taking a sample of the data, the large panel displays raw data from the file (up to the number of lines specified by the 'Row Limit', and we can see that this dataset is a list of US states and their airport codes, delimited by white space.

 

On page 2, clicking 'Guess Schema' will enable the Load Generator to attempt to autocomplete the properties for this file. In the properties panel (lower left panel), the Load Generator has identified the whitespaces as tabs [/t].

However, in the lower-right panel, we can see that only one column of our two-column dataset has been guessed by the wizard.

 

On the final page of the setup wizard, clicking the 'Test' button returns an error to our example. Although the data appears to be delimited correctly, we have already identified that our problem is with the column names. Load Generator has guessed (incorrectly in this particular case) only one column name, and also not provided the correct name. Now is our chance to go back through the wizard and make the necessary corrections.

 

To correct the problems, we simply return to the Guess Schema section of the wizard, and add a second column by clicking the + button. We should take a moment here to double check that the 'Ignore Header Rows' parameter is set to 0, meaning we assume that the file has no header rows at all, so that we can create them manually. Next, in the lower-right panel, we select each field under 'Name' and name the two columns something more appropriate. In this case, we have a column called "state" and a column called "airport code".

 

Returning to the Test, the S3 Load Generator will update the sample table with our new properties. Our test result provides us with a sample two-column table, just like we wanted. The columns are labelled appropriately and we can now go on to use this table with other Matillion ETL for Redshift jobs and components.

 

Clicking 'OK' will return users to the job interface and create two linked components: Create/Replace Table and Load. Each component is parameterised by the S3 Load Generator and can be run as a job by linking to a Start component.