Storage Load Generator

Overview

S3 Load Generator is a tool that helps users load delimited data from public objects in an S3 Bucket (Amazon Simple Storage Service).

Cloud Storage Load Generator is a tool that helps users load delimited data from public objects in Google Cloud Storage.

Unlike common components, the Load Generator does not appear as a standalone component when dragged into the job interface. Instead, Load Generator takes the form of a tool that allows users to load and view files on the fly, altering Load component properties and seeing their effects without the need for a separate Transformation job. The generator can also guess the schema of a table, relieving much of the end user's work.

Note: This Load component generated requires working AWS Credentials with read access to the bucket containing the source data file(s). The is easily achieved by attaching an IAM role to the instance when launching Matillion ETL for Redshift, however it can also be managed manually by editing an Environment. See the example at the bottom of this article.

Furthermore, Matillion requires use of a policy that contains the s3:ListBucket action such as the policy provided in the Managing Credentials documentation.

 

Layout

A file must first be selected from an S3 bucket.A file must first be selected from a Google Cloud Storage bucket. This file must be delimited (including .csv) and the user must have permission to access the file. If the file is compressed, the correct compression method must be selected; supported compression methods are gzip and bzip2.

If the selected file is viable, the user can select the number of rows they wish to sample and select 'Get Sample' for the tool load the file's data and attempt to guess its schema. Raw data for the file is displayed in the uppermost panel while its column data is displayed in the lower-right panel and is available for editing. Schema properties for the table are displayed in the lower-left panel; the available properties are detailed in the below table.
 

Property Setting Description
For more information on all the settings in this component, see the Amazon Redshift COPY syntax for more information.
Data Type Select Delimited: Any file containing delimited data.
CSV: Comma Separated Values file, delimited by commas.
Delimiter Text (Only available if 'Data Type' is 'Delimited'. The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as "\t".
Region Select The Amazon S3 region hosting the S3 bucket. This is not normally required and can be left as "None" if the bucket is in the same region as your Redshift cluster.
Encoding Select The encoding the data is in. This defaults to UTF-8.
Remove Quotes Select Whether to remove any quotes surrounding data values.
Replace Invalid Characters Text If there are any invalid unicode characters in the data, this parameter specified the single character replacement for them. Defaults to '?'.
Maximum Errors Text The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values.This value defaults to 0, but the Amazon default is 1000.
Ignore Unknown Values Select Yes: Accept rows that contain values that do not match the schema. Unknown values are ignored. Note: Will ignore extra values at the end of a line for CSV files.
No: Omit any rows with invalid values.
Quote Characters Text Quote character to use to enclose records. Default (field left blank) is double-quote ("). To indicate no quote character at all, use an empty string.
Quoted Newlines Select Tick to allow a CSV value to contain a newline character when the value is encased in quotation marks. Otherwise a new line character, regardless of quotations, is always considered a new row.
Allow jagged rows Select Tick to ensure missing values are treated as 'null' but accepted. Otherwise, rows with missing data are treated as bad records. Note: A bad record will count toward the 'Maximum Errors' count.
Date Format Text Defaults to 'auto' - this can be used to manually specify a date format.
Time Format Text Defaults to 'auto' - this can be used to manually specify a time format.
Ignore Header Rows Text The number of rows at the top of the file to ignore - defaults to 0.
Accept Any Date Select If this is enabled, invalid dates such as '45-65-2018' are not considered an error, but will be loaded as the null value.
Ignore Blank Lines Select If this is set, any blank lines in the input file are ignored.
Truncate Columns Select If this is set, any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit instead of causing an error.
Fill Record Select Allows data files to be loaded when contiguous columns are missing at the end of some of the records. The remaining columns are set to null.
Trim Blanks Select Removes trailing and leading whitespace from the input data.
Null As Text This option replaces the specified string with null in the output table. Use this is your data has a particular representation of missing data.
Empty As Null Select If this is set, empty columns in the input file will become NULL.
Blanks As Null Select If this is set, blank columns in the input file will become NULL.
Comp Update Select Controls whether compression encodings are automatically applied during a COPY. This is usually a good idea to optimise the compression used when storing the data.
Stat Update Select Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command.
Escape Select When this option is specified, the backslash character (\) in input data is treated as an escape character.
Round Decimals Select If this option is set, round any decimals to fit into the column when the number of decimal places in the input data is larger than defined for the target column.
  Selecting the 'Guess Schema' button will cause Load Generator to attempt to guess as many of these properties as possible. They are all available for manual editing by the user, as is the column data in the lower-right panel. 

When the user is satisfied with the settings, the resulting table output can be viewed by selecting the 'Test' tab at the lower-left of the window. The Test tab will create the requested table on the Redshift cluster and show a sample for the user to inspect. If the user is unhappy with the output, they can return to the 'Configure' tab to make alterations until they are satisfied with the result.

 

Example

In the following example, Load Generator is used to load the contents of a small file into a table. First, Load Generator is given the path to the file, which is not compressed. When a sample is taken, the upper panel displays raw data from the file (up to the number of lines specified by the 'Row Limit') and we can see this is a list of US States and their airport codes, delimited by whitespace.
 

Clicking 'Guess Schema' will allow Load Generator autocomplete the properties for this file. In the properties panel (lower left panel), the Load Generator has identified the whitespaces as tabs. In the lower-right panel, columns have automatically been identified with appropriate properties.

Moving to the 'test' tab, the lower panel shows a sample of the table created by Load Generator. Although the data appears to be delimited correctly, there is a problem with the column names. Load Generator has assumed the first line of the file was used to name the columns, but this was not true in this particular case.

To fix our problems, we return to the Configure tab and specify 'Ignore Header Rows' as 0, meaning we assume the file has no header rows at all. Next, in the lower-right panel, we select each field under 'Name' and name the columns something more appropriate.

 

Returning to the Test tab, Load Generator will update the sample table with out new properties. Sure enough, we now have the table we want; the first line is no longer recognised as a header so 'Alabama' now makes it into the table. Meanwhile, the columns are labelled appropriately and we can go on to use this table with other Matillion ETL for Redshift jobs and components.

Clicking 'OK' will return to the job interface and create two linked components; Create/Replace Table and Load. Each component is parameterised by Load Generator and can be run as a job by linking to a Start component.